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Preface 



The Australian Council for Educational Research was established in 
1930, under a grant from the Carnegie Corporation of New York. Three 
functions were held in 1980 to mark the fiftieth anniversary. Two were 
invitational seminars — one on the improvement of measurement in 
education and psychology and arlother on societal change and its impact 
on education. The third function was the presentation of a history of the 
ACER prepared by Professor W. F. Connell. 

This volume contains the papers and reactant statements which were 
presented at the Invitational Seminar on the Improvement of Measure- 
ment in Education and Psychology. A seminar on this topic was con- 
sidered to be highly appropriate for the anniversary celebrations, as 
measurement in education and psychology has been one of the main 
areas of the work of the ACER since its inception. 

The seminar was held on 22 and 23 May in the Council Chamber of the 
University of Melbourne. Sixty-one people attended, including par- 
ticipants from most parts of Australia and from Canada, The People's 
Republic of China, Finland, Germany (FRG), Great Britain, New 
Zealand, and the United States. A highlight of the occasion was the 
presence of Emeritus Professor R. L. Thorndike, of Teachers College, 
Columbia University, who was especially invited by the ACER to give 
the opening paper. His visit to Australia was supported by the Australian 
American Educational Foundation. 

The seminar was opened by the President of the ACER, Emeritus Pro- 
fessor P. H. Karmel, whose introductory statement is included in this 
volume. 

It was decided in the planning stages of the seminar that the focus 
should be on the contribution that latent trait measures can make to 
education and psychology. In the 1960s and 1970s, psychometricians had 
devoted much eftbrt to the development of latent trait measurement 
models. Yet measurement procedures based upon these models had been 
used for only a short time in the practice of educational c.nd 
psychological measurement in Australia. Many practitioners in the field 
of measurement still had little or no knowledge of the features of the 
various latent trait models. It was thought that the time was opportune to 
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bring some of the theoretical and practical aspects of latent trait 
measurement procedures to the attention of people in Australia working 
in or interested in the held of measurement in education and psychology. 

Papers on various aspects of latent trait models were sought from ap- 
propriate authors in Australia and overseas. These were circulated in ad- 
vance to participants in the seminar. The seminar itself took the form of 
a paper presentation, followed by a reactant statement on the paper, 
followed in turn by general discussion. The edited versions of the papers 
appear in this volume, together with the statements of reactants. Some of 
the discussion is caught up in the final paper, which represents the chair- 
man's attempt to summarize the debate emerging from the seminar. 

The seminar was undoubtedly successful in raising the level of 
awareness of many of the participants about theoretical and practical 
issues in measurement and particularly in latent trait measurement pro- 
cedures. Since the papers represent original contributions by reputable 
authors in the field, the ACER believes that they deserve a much wider 
audience. 

July 1 981 Donald Spearritt 

Vice-President 

Australian Council for Educational Research 
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Introductory Statement 

Peter Karmel 



I have been asked, as President of the ACER, to open this seminar. 
Although I have no technical background in educational measurement, I 
am a consumer of educational measurements and, with my background 
as an economist, I have a predilection for measuring things which is 
similar to yours. I like to quantify attributes and concepts whether they 
are in economics or education or psychology, and my work in educa- 
tional policy has usually had a statistical basis. Because of this I am well 
aware of the conflict one faces when one is trying to measure things. On 
the one hand there is the need to be precise and to emphasize what is 
measur^^ble; on the other hand, in respect of policy questions, it is impor- 
tant to maintain a healthy scepticism about what is measurable and 
about the kinds of inferences that can be made from measurements and 
mathematical models. 

This seminar is the first of three major functions arranged to mark the 
fiftieth anniversary of the establishment of the ACER. The organization 
was set up in 1930 through a grant from the Carnegie Corporation of 
New York. It is therefore most appropriate that the first paper should be 
given by a distinguished scholar from the United States. While this 
seminar is of a highly technical kind, being devoted to improving 
measurement in education and psychology, a second seminar on Societal 
Change and its Impact on Education is concerned with educational 
policy in Australia for the remainder of the twentieth century. The two 
topics provide a nice balance between the interests of the ACER in 
measurement and its technicalities on the one hand, and in educational 
policies in a changing social context on the other. The thirc^ function, 
during the Annual Meeting of the Council in October, is the presentation 
to the Council of a history of the ACER written by W. F. Connell. 

It is relevant on occasions like this to issue a warning about divergen- 
cies between technical measurement and policy prescription^^ derived 
from measurement. Some of the measures we use are relatively simple, 
such as the number of students in the last year of high school or the 
number of staft' in institutions of higher education. But many of our 
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measures, and particularly those which will be discussed in this seminar, 
are statistical constructs which do not precisely portray the concept that 
they are intended to measure. Since they cannot be used in any simple 
way. care has t'o be taken in drawing inferences from them. In 
economics, for example, we talk about the price level, and in Australia 
we have the consumer price index (CPU which governs to a large extent 
the rate of change of wages in this country. The CPI is a statistical con- 
struct which does not measure changes in the price level according to any 
theoretical definition. Similar difiiciilties arise with measures of produc- 
tivity or of the real value of the gross domestic product. 

A second matter of concern is that the use of statistical constructs may 
have unintended results in the practical situations in which they are ap- 
plied. Measures of educational attainment, for instance, may have conse- 
quences of a kind which are not the concern of those who are interested 
HI their technical development. For example, they may lead to the label- 
ling of children with particular kinds of problems, and this in turn can 
have implications for the way in which different groups in the community 
arc treated, as in some of the Title I programs in the USA. 

Thirdly, ihcoretical and mathematical models may be developed which 
Include concepts of which statistical constructs are measures. Inferences 
may be drawn from these models and the statistical estimation of their 
parameters. It is easy to slide into a position where policy prescriptions 
are being made about a real world which is in fact very dillerent from the 
theoretical world. In the field of economics, for example, it has become 
common to measure outcomes against the optimum properties of a world 
in which there is a free market and free competition. The underlying 
model, however, rests on a whole series of value assumptions about what 
is optimum, lurther, the Optimum properties of that kind of world hold 
only if the world is made up of a large number of small units. But we 
know that the world we live in is not made up of a large number of small 
units; it is made up of quite large aggregations of economic power — large 
corporations, trade unions, various pressure groups, and so on. To slide 
from the kinds of policies that one would advocate in a theoretical model 
to advocating those kinds of policies in the real world is simply not 
legitimate. .My intuition suggests that the difliculties experienced in ap- 
plyinu theoretical models in economics to policy questions in the real 
world are likely to occur also with respect to models relating to educa- 
tional attainment or psychological testing or measurement. 

\\ hile it Is important to keep these points in mind, it is the puVpose of 
this hrst seminar to consider some dirticult technical problems In the field 
of measurement in education and psychology. This is a very proper way 
to begin the celebration of the fifhclh birthday of the ACER, given the 
role that measurement has played in its activities over the past 50 years. 





THc Improvement of Measurement in Ldmvtutn and Pyvehtf/oiiy 
Id Med b> Donald Spcairnt 
C op\rjghi ^ AC ER 1982 



1 

Educational Measurement— Theory 
and Practice 

Robert L. Thorndike 

I am honoured to open the program of this seminar celebrating the 50th 
anniversary of ihe Austrahan Council for Educational Research. It 
represents a rruijor milestone in the history of this distinguished 
organization. 

Wc could also, perhaps, claim to be celebrating the 75th anniversary of 
the birih of psychometric iheory and practice, for it was just 75 years ago 
that Binet and Simon published their report of the first workable scale to 
assess intelligence. It was just 75 years ago, give or take a year, that 
C liarles Spearman published his model of intellectual ability expressed in 
terms of jtf, a general intdlectual factor, together with s factors each 
specific to a single test, and also the model of test score in terms of true 
score and error that provided the foundation of much of classical test 
theory. In the same year, in the United States, Edward Thorndike'^ Men- 
tal a^id Social Measurements introduced basic statistical concepts to the 
educational profession. We are still a relatively young discipline, when 
conipvirjcd with the total r'ini*e of disciplined inquiry, but we have moved 
far enough along so ihat ^A e can well afford to pause to consider where we 
have been, where wc are, and where we may he going! 

E'ducationai and psychological measurement have, since their incep> 
tion, involved the parallel streams of practical test development and for- 
mulation of theoretical models of test performance. Binet was' a 
praemaiisi. assembling in his scale of intelligence a wide variety of tasks 
that he found useful in the practical task of differentiating those who 
were making normal school progress from those who were having 
difficulty in school. His work had little self-conscious theoretical struc- 
ture. C oncurrently Spearman was offering the theory that would provide 
a rationale for BinetN procedure, in that it postulated a pervasive general 
factor, .tf, extending through the whole range of cognitive tasks. 
Throughout the course of the past 75 years these two facets of the enter- 
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. prise have continued side by side. As the practical test makers have 
generated test exercises, set scoring procedures, and combined the results 
into test scores, the test theorists have constructed models to account for 
how people perform on a test and to explicate what the test scores 
signify. We arc never interested in the bits of behaviour that appear on a 
test solely for themselves, but always as signifying something more 
general, more lasting, more fundamental about the individual. 

Models have, from the beginning, been developed to account for two 
distinct, but not unrelated, aspects of test performance. On the one 
hand, it was observed that two measurements designed to be as nearly as 
possible measures of the same identical characteristic of a person did not 
yield identical scores. A theory was required of measurement error. On 
the other hand, it was observed that measurements of what were de- 
signed to be ditlereni characteristics of a person were usually not inde- 
pendent, but were in varying degree related, A theory was required to 
picture the organization of human traits, 

^ The classic theory of measurement error, or test reliability, presented 
in its essentials by Spearman 75 years ago, viewed a test score as made up 
of two components, a 'true score' and an error. The true score and error 
were conceived of as completely independent. The true score was viewed 
as unchanging from one form of a test to a parallel alternate form and 
from one occasion to another. The error was considered to be unique to 
the specific measurement, and to be entirely independent of the error that 
might equally he expected to appear on another measurement designed to 
assess that same characteristic of a person. Of course, the true score, as 
such, could never be directly observed. Its existence and properties could 
only be, inferred from consistency of performance from one test exercise 
to anoiher or from one test score to another. 

This classic model of true score and error dominated the conception of 
lest scores for most of the tirsi 50 years of psychometric theory. It was a 
productive model in that it led to the formulation of a number of useful 
relatior^ .hips. Very early it produced a statement of the relationship be- 
tween test precision and lest length. It permitted the development of 
estimates of the precision of difference scores and chnnge scores, bring- 
ing out their fragile and undependable nature, and at the same time it 
pcrmiued estimation of the properties of composites of two or more 
different measures. It led to estimates of the degree to which indices of 
relationship between measures of diflerent attributes are attenuated by 
the error of measurement in each. 

However, at the same time, this model generated certain problems. 
Thus, when two forms of a test yielded noticeably ditlerent mean scores 
for a group, which one should be considered to correspond to the true 
score? Or when a test form yielded higher scores if given second in a 
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sequence of iwo lesiings (haii when il was the first test, which should be 
considered the true score— when s^w^w first or when given second? 
Again, if different daia-gathcring procedures — split-test, alternate form, 
relesl — gave ditfereiu values for the reliability coctVicient, which one cor- 
rectly identified the 'true' proportion of true-score variance? 

So a soniewhal different model has emerged in the past 25 years. This 
views any given test perforinance as being a sample from a universe of 
behaviour dctuicd in a particular way, and thinks of true score as being 
the complete universe score that would be approached if the sample 
could be increased without limit. The dilVcrencc may seem a somewhat 
trivial one. but it docs make it easier to recogni/e that we may dehne 
different universes to which wc may wish to generali/e, and that different 
procedures for collecting test reliability data do imply different universes. 
As a corollary, it leads us to rccogni/c that *crror' is not a monolithic en- 
lily, but involves a number of distinguishable components arising from 
different sources that can be separately identified and separately 
analysed. This leads to a componenis-of-variance model for tesl scores 
that encourages us to lease out the several sources of variance, other than 
between-persons variance, to esiiniate the magnitude of each, and to 
plan a dala-collecling strategy for future research that will hold tliese un- 
wanted sources of variaiice to a minimum. This view of tlie 
generali/ability of test scores has been most systematically explored by 
C'ronbacI) and his associates in their 1972 book. The Dependability of 
fk'haviourai Measurements. 

At the same time that a model was being developed for test measure- 
men I error, models wee also being formulated for what was being 
measured by uue score on tests of different but related functions. Tlie in- 
itial fornuilation was Charles Spearmairs, as was the initial fornuilalion 
of true <eore and error. It was phrased in the now long-familiar terms of 
general ability, i^, pervading a whole set of cognitive measures, and a 
specific factor, s, for each separate measure. This conception of the 
nature of abilities proved less diirable tlian the truc-scorc-and-error 
model because the accumulation of empirical data soon showed that il 
was an over-simplification of the manner in which abilities are organized, 
and thai lests are lied iv)geiher nuicli more complexly than by a single 
general factor. Over the past 50 years, nuiny different models have been 
pioposed to account lor I lie observed correlations among lest scores. 
Some have postulated a number of distinct general ability factors. Some 
have called for the inclusion of group factors, less pervasive than the 
general ones. Some have introduced second-order factors to make ii 
possible to adniii general laclors thai were not independent but rather 
were related lo each other. 

A massive body of siatisiical theory and computational technique has 
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developed to service the analysis of voluminous sets of lest score data 
thai have been gathered to elucidate the nature and relationships of 
human abilities. Yet the general finding remains that a wide variety of 
cognitive performances are in fac( all related; the view persists that a 
model must provide some role for a general component of ability; and we 
can still view tests such as the Binei and Its many descendants as devices 
to estimate an individual's status on one widely relevant cognitive ability. 
This is not the place to examine in any further detail the structure and 
subs(ance of ditfereni factor analytic models that use the interrelation- 
ships of difi'ereni tests as evidence on which to build a theory of human 
abilities. Rather, let us turn to the scopes on a single lest, and to the 
responses to a single test exercise, and *^ee by what model these may be 
understood. \ 

The early tests that aspired to measure human abilities were composed 
largely of exercises that required the examinee to produce or construct 
the responses. Responses were varicd/and each was then scored using 
some scoring guide that indicated the degree of acceptability of different 
possible responses. Exercises were developed primarily on the basis of 
editorial judgment, and little by way of psychometric theory or statistical 
analysis was applied to the selection of single lest exercises or ihe com- 
bination of them in 10 a lesi. 

However, at the lime of and shortly after World War I, there emerged 
in the Uniied Stales an enthusiasm for selective-response items — true- 
false or nniliiple-choice exercises in which the examinee was required to 
select from among (hose that were presenied to him the correct or best 
answer, in part because of the pre-established keying of the items, in part 
because ot tiie susceptibility of such items to ambiguity because of un- 
skilled drafting by the item's author- resulting in unintended interpreta- 
tion by the examinees of one or more of the response alternatives, 
preliminary trv-out and statistical analysis of the test items became the 
accepted practice. During the I92()s and 3()s a great diversity of statistical 
indices was proposed to express an item's etfectiveness in diflerentiating 
the more from the less capable examinees -capable in terms of what the 
set of 'ienis was ifiiended to measure. By 1940, procedures had pretty 
much s(abih/ed on the correlation, biscrial or point-hiserial, between 
item and a total score designed to represent the attribute to be mea.sured 
by the test. Tacility, as represented by percentage of correct responses, 
and discrimination, as represented by correlation with total score, 
became standard working criteria for evaluation and selection of test 
ilem>. Psyrhonielric theory was extended to formalize the 'elationship 
between item parameters and test parameters, showing how test 
parameters can be controlled by a judicious selection of test items. This 
state of the art prevailed from the time the American Council on Educa- 
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lion published in 1951, under E. F. Lindqulsi's editorship, the first 
edition of the handbook entitled Educational Measurement up until 
about the time that 1 'rode herd* on the second edition, which came out in 
1971. 

At the theoretical level, the conception of the nature and function of a 
test was continuously evolving. As 1 view the scene, there are at the pre- 
sent lime two major competing models to describe what a test score 
represents, and a number of variations of one of these. The two major 
models may be designated the ^domain sampling* and the 'latent trait* 
niodcl. Let us take a look at each of these in turn. 

As its label suggests, the domain sampling model starts out with the 
assumption that there is some definable domain of knowledge, 
understanding, or skill, in its clearest form, the domain is limited in 
scope and is precisely defined. Such a domain could be comprised of 
something like the 100 basic multiplication combinations, or the use of 
commas in scries, or the capitalization of proper names. The domain is 
viewed primarily as having a certain horizontal extent, with minimal at- 
tention to its vertical dimension. The exercises in a test are viewed as 
being a sample trom the defined domain, usually a random sample, 
though sometimes a stratified sample. The appropriate inference from a 
test score is considered to relate to the proportion of all the test exercises 
that might be drawn from tfiat domain c which the individual would be 
expected to succeed- the individuaPs completeness of mastery of the 
domain. 

This iiiodel has had a good deal of popularity, in the United States at 
least, over the past 20 years. Tests based on the model are often called 
•criterion-referenced tests', to contrast them with traditional norm- 
referenced tests. When used most appropriately, each of these tests does 
focus on a highly specific domain, and gives information that is im- 
mediately relevant fo a specific instructional decision. Has the pupil, or 
the class, reached a level of proficiency on Skill A that means that further 
teaching of that skill is not required and that Skill A is available as a 
foundaHon for teaching Skill B? In practice, however, the model has 
been called upon to rationalize much more general assessment and 
evaluation purposes, such as evaluation of the strengths and weaknesses 
of a pupiPs development or of a school's curriculum. Because of its wide 
current use in present day educational measurement, ii is important that 
wc examine this model to see what its assumptions are and when and to 
what extent they are justified. 

I fic basic assumption of any sampling enterprise, whether the universe 
sampled is one of persons or one of test exercises, is that one can define a 
universe of objects or events with sufl'icient completeness and precision 
so that one can decide in each instance whether a specimen is a member 
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of the defined universe and whether a set of specimens constitutes a 
representative sample from that universe/The sate procedure tor draw- 
ing a representative sample is to enumerate all the specimens that com- 
prise the universe and to use some system ot* randomization to draw a 
sample from among the enumerated specimens. Clearly the segments of 
educational achievement, tor which such an unambiguous definition; 
complete enumeration, and random sampling is possible, are relatively 
few and of relatively limited importance. 

Perhaps we can agree that^a universe specified as ^Presented with two 
single-digit numbers in the format 2x3 and asked to give the product, 
gives the correct answer' does consist of 100 enumerable elements, and 
that some system of random numbers could satisfactorily be applied to 
draw a random or a stratified sample from this universe. For how many 
of the significant competencies that education strives to produce is .such a 
procedure possible? We may feel that there does exist such a domain as 
^ability 10 speir or ^knowledge of biology\ but how practical is it to 
specify the boundaries of such a domain or to enumerate the elements 
that fall within it? If we cannot specify the boundaries of the domain, 
how can we be sure that the sample of tasks included in our test covers 
the full scope of the d(nnain? If we cannot enumerate the clements'that 
fall within the domain, how can we know whether we have sampled ran- 
domly from them? And if we are not able to specify the limits of the 
clomain, in what sense is it meaningful to say that a pupil has mastery of 
it? Within narrow limits it may be reasonable to say that, as of this Fri- 
day, Mary shows mastery of the 20 words in this week's spelling lesson, 
or that, tested at some particular lime, John shows mastery of subtrac- 
tion problems with zero in the minuend. But, for most of the range of 
significant educational achievements and for any tests that are thought to 
assess general abilities, the assumptions of a domain sampling model are 
met at best only roughly and approxiriiately. 

We do, of course, prepare course outlines and syllabuses that sketch in 
broad outline the content and objectives of a program of study. Long 
before educators began talking about criterion referenced as opposed to 
norm referenced tests, makers of educational achievement tests -used 
such outlines to guide them in planning the coverage of their in- 
struments. In that sense, there is no conflict between the old-line norm- 
referenced and new-style criterion-referenced approach. The issue is 
whether the domain to be tested is suflieiently describabic and specifiable 
for one to be able to assert that it has been sampled in toto in a represen- 
rative way, and so that some percentage of test items answered correctly 
can be considered to represent effeclive mastery of the domain. For much 
of what we are interested in assessing, this does not seem to be the case. 
The alternative to a model based on a domain with lateral extent is one 
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ihai focuses on a trail dimension, perceived as primarily vertical. In this 
model, the function of a test is conceived to be to estimate an individual's 
location on that vertical dimension — not 'How many?' from a domain of 
tasks, but "How much?' on a dimension representing a trait. The trait or 
attribute is a construct rather than an observable. From the psychometric 
point of view the attribute assessed by a test need not be psychologically 
simple, though from the psychological viewpoint this would make for 
clarity. The test tasks may involve quite a complex of functions, but the 
latent trait model assumes that, to a reasonable approximation, the com- 
plex is the same for all the test exercises that make up the test. 

The vertical latent trait model seems most obviously appropriate for 
lest tasks that vary widely in difficulty but relatively little in kind - for ex- 
ample, responding to analogies of increasing .subtlety, comprehending 
reading passages of increasing cotnplexity, remembering lists of increas- 
ing length. Tor such attributes one tends not to worry very much about 
providing precisely defined horizontal boundaries to the trait. Rather the 
trait is roughly dcfmcd by the content and form of the tasks, and perhaps 
by the factor structure of the task as its relationships to other types of 
test exercises arc studied by factor analytic procedures. Nor are there 
defniuble vertical boundaries, because it is usually not po.ssible to specify 
limits on the facility of the task at its easiest or the difficulty of the task at 
its most difficult level. A test's function then is to locate each person on a 
vertical scale of indefinite extent in relation to anchor points provided by 
tasks scaled for their difficulty or by the performance of his fellows. 

When the focus of our concern is educational achievement, the latent 
trait model is vaguely unsatisfying, because it seems somewhat in- 
congruous to speak of a trait of, for example, competence in history. The 
unifying attribute seems to belong to the domain of knowledge rather 
than to the individual examinee. But usually there are no sharp boun- 
daries to the domain: test exercises relating to it do difler substantially in 
difficulty or, at least, in the probability that students will succeed with 
them; it is possible to arrange people on a continuum on the basis of their 
ability to succeed with tasks drawn from the domain and to arrange the 
tasks from the domain in a continuum with respect to the likelihood that 
a person will succeed with that task. So a dimension of 'competence in 
history* is perhaps one on which ditfc^rent individuals can be placed at 
different levels. Thus, in broad-gauge measures of achievement, as well 
as in measures of aptitude (and interest or attitude), it seems plausible to 
think in terms of a dimension of performance upon which a particular in- 
dividual may be high or low. 

We turn now to alternate models for thinking of a test, together with 
the items that compose it, as a device for locating individuals on the con- 
tinuous scale of some latent attribute, where we think of the scale as 
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represcnimg degree or level of thai attribute. Our attention must focus 
first on our model of a test item. 

Whereas in ihe domain sampling model we viewed the test exercise as 
one element drawn from that domain, and passing the test item as 
evidence increasing the proportion of the tasks in that domain that the 
examinee was considered to have mastered, we now view the test exercise 
as providing a cue as to where on (he scale of the latent attribute the in- 
dividual fails. If he passes the item, the chances are that he falls above 
the difficulty level that is exemplified by the item. If he fails the item, the 
chances are that he falls below that level. But it is only a matter of prob- 
ability, becau.sc our model indicates that the likelihood of success on the 
item increases gradually and continuously as we move up the scale of the 
latent attribute. Our model specifies a probability function that is a con- 
tinuous function of the latent attribute- typically the cumulative normr^l 
ogive or the logistic, two curves that have nearly identical properties. 

DitVerent items may differ in one or all of three parameters that 
describe these functions. These are, respectively: 

1 a parameter that represents the steepness of the function, the rate at 
which the probability of success increases as one goes up the scale of the 
attribute; 

2 a parameter that specifies the location of the function's point of 
inflexion in relation to the scale of Ihe attribute, representing the diffi- 
culty of the item -the level at which just half the examinees are deemed 
to know the answer; 

3 a parameter that represents the lower asymptote for the item -the 
probability of success for persons at very low levels on the latent 
attribute. 

Current expositions of latent trail theory usually adopt one of two 
contrasting positions with respect to the role of these three parameters. 
One school of thought, represented by Rasch and his followers, elects to 
assume that the steepness parameter may be considered to be uniform 
over all items and that the lower asymptote may uniformly be considered 
to be zero. Thus items are deemed to differ only with respect to their 
difliciiliv. Then one need only estimate item dilliculties, that is, where the 
infl'xtion point lor the item falls on the scale of the latent attribute, to 
characieri/e that item fully. The scale values of the items in a set then 
enable one to estimate scale values tor each possible total score based on 
that set of items. The contrasting school of thought, led by F red Lord at 
the I ducational Testing Service of Princeion, New .^ersey, would con- 
tend that one should undertake to estimate all three parameters for each 
item, and should use the full set of item characteristic curves to estimate 
the scale values corresponding to different scores, and consequently to 



lulucational Measurement - Theory and Practice 



11 



ihc locaiion of ditVorcni individuals, l.ci us examine some of the virtues 
and some of the shortcomings of each of these approaches. 

The Rasch one-parameter model certainly has the advantage of 
simpHciiy. A first approximation lo the necessary scale values can easily 
be calculated vvith a hand calculator. With only a single parameter to be 
estimated for each item, the estimates have some prospect of stability 
even with pupil samples of modest si/e, which is to say in the hundreds 
rather than in the thousands. As Ben Wright has pointed out, in scoring 
the typical test wc do act as if all the items had a common steepness 
parameler-lhat is, the same correlation with the underlying attribute. 
We act this way in that we combine items with equal weights and do not 
try lo give greater weight to the more discriminating items-those items 
with greater item-trail correlations. Inirlhermore as the process of item 
selection during test construction has ordinarily weeded out those items 
wtlh dehniiely low item-trail correlations, the ones that will show a much 
Hiittcr item characteristic function, the range of item-trail correlations is 
reduced a giuni deal in those items that survive preliminary screening and 
make it to the fmal form of the test. Perhaps we do not strain reality too 
much if we assume that the slopes are all equal. 

However, in many instances, the model is an oversimplification of 
reality, Wc do find that some variation in the steepness parameter does 
remain in chosen items. Tor example, when item discrimination indices 
were compared for two separate groups of pupils (over 2000 in each 
group) that had been tested with our Cognitive Ability Tests, the correla- 
li(Mis from one group to the other of indices within a sub-test ranged 
from 0.8.8 10 0.93 with an average value of 0.90. Clearly the items were 
not all eciually saturated with the common attribute that they were 
measuring, f urthermore, with multiple-choice and especially with true- 
false items, the assumption of a zero asymptote at the low end of the 
ability scale is .jlmosi certainly incorrect. Though many persons of low 
ability will oi.iit :m item ^%here they do not know the answer, many others 
(at least in the USA) will guess and, unless the item writer has been more 
than usually skilful, that guess may be restricted to the two or three more 
appcii/ing (options. ConsequeniTy all examinees are likely to have a con- 
sidciiihle probability (M" hitting the correct answer. 

I luis the luie-parametcr model provides an approximation that may be 
pretty rough in s(Miie circumstances. It seems most defensible in the case 
o\ c(Histructed-resp(Mise items in which the examinee must generate the 
fcspiMisc. and for tests ctunposed of items that have survived a rigid em- 
pirical pre screening. I specially for these, its computational and concep- 
tual simplicity can make it cjuite attractive. 

The three-parameter model treats as real and significant the ditVerences 
in steepness and in asymptote of item characteristic functions, as well as 
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diflerences in dimculty. With the resulting large number of parameters to 
be estimated lor a set of items, really large data samples are called for if 
the estimates are to show satisfactory stability from one sample to 
another. Thus use of this model in a practical selling makes sense only 
for tests m which try>out of the items on largp samples --of over 1000, if 
one accepts Lord's recommendations -is a practical possibility, Alas few 
of us are so situated. Furthermore estimation calls for high-speed and 
high-capacity computing facilities. When all of these conditions arc met 
and especially with the widely used multiple-choice format, it seems 
likely that the more complex model will provide a better description of 
each Item, even if that information is only partially used in decisions 
related to or based on the set of items. 

One assumption basic to both of these latent trait models is that the 
parameters of a test item are invariant, depending only upon the proper- 
ties of the Item and not on the group to which it is administered. If an 
Item that is easy, relative to other item.s, in one group is difficult in 
another, then any general statement about the difficulty level of the item 
per se is meaningless. Uniformity seems most likely to prevail for items 
that depend primarily upon level of maturity and upon broad general 
background. Problems seem most likely to arise with items that are based 
upon specific school instruction, especially when that instruction is likely 
to vary widely from one place to another. Thus difficulty of an item call- 
ing tor selection of the prime numbers from the set 31, 33, 35, 37, 39 is/ 
likely to be rm- much less for a group of 10-year-olds who 'have juJ 
started a unit on prime numbers than for a group who has never receiv/d 
such tocused instruction or has received it in the more remote past./On 
the other hand, ditliculty of a matrices or a figure analogies problem, 
relative to other items of the same type, seems likelv to" be rela/ively 
stable from group to group. We conclude, then, that attempts to eApress 
a person's status by the level at which that person falls on a vertic/il trait 
dimension is most defensible tor ability measures that rellect Lneral 
growth and the broad range of common experiences. / 

How will latent trail models intluencc our practical procedure^ of lest 
development and our interpretation of test scores? How will we proceed 
ditrcrenily from the way that we have in the past when we relied upon 
item ditliculty and item discrimination indices? These questions will be 
addressed in detail by some of the later speakers at this seminar. I will 
only hazard a couple of guesses. 

For the large bulk of testing, both with locally developed and with 
standardized tests, I doubt that there will be a great deal of change. The 
items that we will select for a te.st will not be much difVerent from those 
we would have .selected with earlier procedures, and the resulting tests 
will continue to have much the same properties. The e.ssential feature of a 
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laient iraii model is thai test score is interpreted as a scale value on the 
vertical scale of the Uu^ra trait, rather than b'eing expressed in normative 
terms in relation to some reference group of persons. It is more than 50 
years now since Edward L. Thorndike and his associates developed the 
Intellif^ence Scale CA ID an effjrt to express level of intellectual per- 
formance in an equal-'-iUt vertical scale, but this scale never achieved any 
great measure of popular acceptance in the testing community. Nor- 
mative reference rather than absolute scaling has always seemed more 
meaningful and useful, and I dcubt that this will change. 

There is, however, one field in which efficient use of item parameters to 
estimate ine individual's location on a trait dimension is likely to prove 
crucially important. This is in individualized testing, whether by human 
examiner or by computer. When it becomes important to obtain the 
greatest possible amount of information about an examinee in a limited 
amount of testing time, then it is vitallv important that each lest exercise 
be one (hat will yield the maAimuni amount of information about that 
examinee, riiese are the items on which we start out with the greatest 
amount of uncertainty about whether the examinee will get the item 
right, Tliey are items for which the item characteristic function is steepest 
at the ability level where the examinee is believed to lie. Adaptive testing, 
which progressively refines the estimate of an examinee's status after each 
item and picks the next test exercise to match the current estimate of that 
status, is one field in which the item parameters of single items will play a 
key role, 

it will also be true that accumulation of data on item parameters for 
large pools of items, to the extent t^at these parameters are stable from 
group to group and from time to time, will make possible great flexibility 
in lest construction. Such a calibrated item pool will make it quite easy to 
generate alternate forms of tests, equivalent in difficulty, that can pro- 
vide estimates of performance expressed on a common scale-estimates 
that will fcccilitaie studies of growth and change, that will permit testing 
different candidates with different but equivalent sets of tasks, and that 
may have a range of other practical u.ses. 

The practice of educational and psychological measurement has 
evoKcd gradually over the course of lime. Our models gradually become 
more explicit, belter rationalized, and, we hope, more accurate reftec- 
lions of examinee test-taking behaviour and of the abilities that lie at the 
back of that behaviour, I have tried to describe to you the present status 
of certain models, ? is highly appropriate at this point in time that we 
take slock of tliose hkHcIs and see what they have to offer for our prac- 
tice in the years ahead. 
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Professor Thorndike's knowledge of the traditions of edueational 
measurement began. I suppose, at his father's knee. The riehness of the 
subsequent work, and the experience which shaped it, is evident in the 
perspective which his review has provided for us. The review sets the 
various theoretical and apphed traditions in perspective and leads u.s 
neatly to the concluding invitation to take stock of the measurement 
models now current His paper thus sets the context for our seminar 
without attempting ti. pre-empt other speakers with definitive projections 
of what might prevail and what might disappear f rom the contemporary 
body of measurement theory and practice, 

I hnd myself encouraged by the analysis in the paper to commence this 
discussion with a first attempt at taking stock. The attempt, in classical 
theory, to provide a theory of measurement error was identified as a 
significant feature of the early developments, 1 wonder if that issue might 
not be helpfully pursued as an important point oi' comparison of the 
models. 

Since the error variance could not be estimated directly from inira- 
individual variability on multiple measurements with the same in- 
struments, an indirect means of assessment was required. Classical 
theory provided the means for using inter-individual variability over two 
occasions as the basis for estimating error variance. It has probably been 
unfortunate that the correlation coeflicients from which the standard 
errors of measurements are calculated have been more dominant in the 
language used to describe the properties of tests and to judge the utility 
of the measurements they provided. 

Speaking generally of a test's reliability can obscure the significance of 
the assumption that the test has a particular lack of precision which is 
constant, if not for ail applications, then at least for all individuals on a 
particular application. The possibility of one individual's status not being 
assessed :\s precisely as another's is simply not accommodated, 

Cienerali/ability theory, as Professor Thorndike's review makes clear, 
offers a refmed conception of error of measurement through more com- 
plete specification of the various sources of error. I have found this ap- 
proach helprul, particularly in the development and use of observation 
schedules; but it needs to be pointed out that, although this approach 
allows the estimation of the magnitude of error attributable to different 
sources, it still presumes that the errors of estimation for all individuajs* 
scores are the same. 
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I am reminded of a claim I heard recently that the most significant ad- 
vances in the design ol sailing vessels occurred after the introduction of 
steamships. I now >Aonder whether the refinement of error estimatjon 
provided by generali/.ability theory, within the general framework of 
classical theory, cannot be seen similarly as the product of zealous atten- 
tion to detail in a dying cause. I wonder where weekend sailors or 
designers and owners of 12 metre yachts awempling to win the America's 
Cup would force me to push that analogy. Sailing boats have not dis- 
appeared despite their loss of utility. 

l.atent trait models provide the means of more precise analyses of error 
variance. The magnitude of error variance depends on the precision with 
which the available items can locate each individual's ability. Those 
whose ability is in a range poorly covered by items are less precisely 
measured than those for whom there are more items with difficulty levels 
close to their level of mastery. A simple analogy can be drawn from a 
measuring slick on which the fine divisions have been erased in some 
sections, 

Ciiven this capacity for more precise specifications of m.easurement 
error for each level of measurement, should we not abandon the earlier 
conceptions, albeit with due acknowledgement of iheir historic contribu- 
tion? Professor Thorndike's explication of the historic developments 
encourages nic to press the case for latent trait models to this extent. 

How should we then deal with criterion-referenced measurement? Its 
development was an attempt to free measurement from the circularity of 
norm-referenced measurement but perhaps an ultimately futile one. Pro- 
fessor Thorndike does speak of both domain sampling and criterion 
referencing. I would preserve the distinction between them but now refer 
only to the latter. Items which have been selected from a domain, if they 
vary in difficulty, can be located on a vertical dimension. The predilec- 
tion of criterion-referenced measurers for worrying only about whether 
individuals are above or below some point does not remove the under- 
lying continuum or reduce the value of a fuller view of it. It is a latent 
trait, and more precise location of individuals on it is surely better than 
determining only whether they are in one region or another. Can I claim 
encouragement in Professor Thorndike's analysis for asserting that in 
criterion-referenced measurement we have another sailing vessel, 
perhaps even one designed by a Thor Heyerdahl who would eschew not 
only steam for propulsion but even wood for the hull. I will let those at 
the AC ER who purport to divide the world into literates and illiterates or 
numerates and innumerates defend themselves. 

In characterizing the differences between the main approaches to latent 
trail measurement theory and practice. Professor Thorndike has set out 
clearly for us the differences in model complexity and ease of application. 
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1 would like to highlight an important ideological dift'erence which forces 
me to address not only questions of utility in judging the alternatives. 

The three-parameter model is the natural extension of the earlier 
classical theory. It accommodates the baggage of classical theory-item 
difficulty and item discrimination and, with the addition of the third 
parameter, even accommodates the US love of the multiple-choice item. 

The proponents of the three-parameter model like to see the one- 
parameter model as the degenerate version of '\\q\x own more complex 
model, perhaps to be used on days when they are feeling simple-minded. 
That, however, inadequately acknowledges the fundamental view of the 
developers of the one-parameter model. 

They do not include the second-parameter, for item di,scrimination, 
because it will accommodate the possibility of individuals of a given 
ability having a greater probability of being correct on a harder item than 
an easier one -at a point where the item characteristic curves have 
crossed. The second-parameter brigade says that items in the world are 
like that so their characteristics should be represented. The one- 
parameter brigade says that measurement cannot occur with such in- 
struments so such item behaviour should be identified as failure to fit the 
model. The battle is thus joined by proponents of the one-parameter 
model on the ideological ground of beliefs about the nature of measure- 
ment. 

On the grounds of utility, the battle is joined by supporters of the 
three-parameter model in terms of questions such as tho.se Profe.ssor 
Thorndike outlined. When, for example, is it worth the eftbrt and the 
extra data required to extract three parameters? 

Proponents of the one-parameter model, however, do not see these as 
meaningful questions. Once they have stopped asking why one would 
ever be jus ified in including more than one parameter, they seek to 
demonstrat that, with dichotomous responses, there is insufficient infor- 
mation to estimate more than one parameter. For them, it is only by the 
sleight of hand, which arbitrary constraints in computer programs can 
conceal, that second and third parameters can be estimated. 

Professor ThorndikeVs overview .stimulates me, then, to start the 
discussion by asserting that the early classical formulations of reliability 
and their extension to generalizability can be dispensed with, that 
criterion-referenced measurement as originally introduced led us up a 
blind alley, and that the measurement theory debate now is lodged in the 
latent trait domain. The battle is joined by one group claiming that the 
othc' S view represents just a special case of its own more complex one. 
The others, in their turn, claim that the complexity their opponents 
prefer ought not to be sought but, if sought, cannot be quantified. 
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Comparing Latent Trait with 
Classical Measurement Models 
in the Practice of Educational and 
Psychological Measurement 

John A, Keats 

ln\:oinparing latent trait theory with true score theory one may apply the 
criteria of either to the other. This practice will, of course, favour the 
theory which emphasizes the criteria being used. For example, some have 
criticizedvlhe Rasch model by showing empirically that selecting items in 
terms of this model will not necessarily produce the most reliable test of a 
given size from the item pool available. The result is predictable from the 
models as the true score model and associated methods tend to maximize 
reliability in some sense whereas the Rasch model requires different 
characteristics of the items. Alternatively one can argue that, since the 
relation«^hip between true score and underlying ability is not readily 
specifiable for most tests, the true score model has defects. Such an argu- 
ment is based on the criteria which form the basis of latent trait theory. 

In the early 1950s, Gufliksen (1950) produced a synthesis of various 
contributiotis to true score theory and Lazarsfcid (1950), Lord (1952), 
and Arbous and Kerrich (1951) produced applications of latent trait 
theory to attitude scaling, test theory, and accident proneness 
respectively. 

Specifically the model described by Arbous and Kerrich is based on the 
relation between the proportion, p(x), of subjects having a: accidents in a 
given time and a hypothesized latent attribute, accident proneness (X) 
such thai: 

p{x)= ]e''^[ dF(\) 

0 xl 
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where F(\) is the cumQIalive frequency distribution of X in the population 
studied. These writers are careful to point out that success in predicting 
the values of p(x) only makes the accident proneness assumption more 
plausible but there are other possible interpretations. 

In his treatment of the normal ogive latent trait ability model, Lord 
(1952) assumes that the probability of subjects of ability 6 giving the cor- 
rect response to item g may be written as: 



where a, and b, correspond to parameters of item g, *(/) is the standard 
normal frequency function, and / is a dummy variable which disappears 
in integrations. 

Lazarsfeld (1950) explored the latent linear model which may be 
written as: 



relating the proportion of subjects giving the correct response to item /, 
p,y with the item parameters fir, and 6,, to the underlying variable jc. 

The true score model has its origins in physical measurement^nd the 
study of errors of measurement. It is assumed that a person's score (X,) 
on a test may be regarded as consisting of two additive components: the 
true score (T,) and the error score (£.) so that: 



The true score is thought of as a parameter of the person at the time of 
testing whereas the error score is thought of as due to minor fluctuations 
caused by irrelevant factors. Gulliksen (1950) produced a systematic ac- 
count of this modQl. Ehrenberg (1953) criticized Gulliksen (1950) on the 
grounds that this account does not make clear what is being estimated 
and what the basis of estimation is. Rasch (I960) takes up these points 
and uses them as a justification for his latent trait model. 

The questions of definition and estimation of true scores are taken up 
by Lord and Novick (1968) but their proposed answers were criticized by 
Lumsden (1978) and more systematically by McDonald (1979),t There 
appear to be some unresolved questions related to the precise status of 
the concept of true score even within the context of the model itself. 
Despite these problems, there is a large and growing literature developing 
the true score model, particularly addressing the question of the fre- 
quency distribution of true scores. Keats and Lord (1962) presented the 

t Personal communicaiion, Ncwcasrlc, 1979. 
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beta-binomial model of the frequency distribution of true scores and raw 
scores. This model has been elaborated by Keats (1964; 1965) and Lord 
and Novick (1968) and more recently by Huynh (1977) in the context of 
mastery scores. 

In its simplest form, the beta-binomial model assumes that the fre- 
quency distribution of number correct raw scores, g{x), can be written in 
terms of the distribution of true scores, f(p), in the following way: 

g(x)= i Op-il - pr-'Ap)dp 

0 

If the regression of true score on raw score can be represented as a 
polynomial in .v then: 

x\(b,)Ab2). . . . 

where n = the number of items, ou 02 . . ; b,, bi . . . are parameters to be 
estimated and 

(«,),= "^■'^^ = 0,(0, + +2)...{a,+x- Detc. 
r(fl,) 

and the constant A', ^(0), makes the sum of the ^(.v) values unity. If the 
regression of true score on raw score is linear, then: 

and this form can also be obtained by taking as a beta distribution, 
hence the name beta-binomial model. It is worth noting that defining true 
score in this way implies that it is a latent trait accounting for differences 
in raw score along the lines of the accident proneness trait defined in the 
Arbous and Kerrich model. However this is not the usual way of defining 
true scores. The latent trait ability models of Lord and later of Rasch 
define ability irt terms of performance on an item rather than on the total 
test. 

Although there appears to be no general systematic account of the la- 
tent trait model comparable to the treatment by Lord and Novick (1968) 
of the true score model, there are many recent developments of the 
model in the context of test theory. The main problem tackled is that of 
estimation of parameters and significance tests for the applicability of the 
model, e.g. Guslafsson (1979). 

Most of the developments in the literature on both models are either 
statistical in content or deal with problems of application in education 
and psychology. There has been no attempt to compare the models in the 
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more general context of, for example, cognitive development, Rasch's 
(1960) claim that his model leads to the estimation of a person 
parameter, for example an ability, can only be justified in the context of 
adult behaviour in which the ability of a person may be thought of as ap- 
proximately constant over a long period of time. When subjects under, 
say, 15 years are considered, this 'parameter' as defined and estimated at 
a particular time will not be the same a year or even six months later. It is 
important to use the Rasch approach to define the dimension measured 
by a set of items, but any measure of a person at a given time is only 
meaningful if it can be related to the value towards which the person is 
developing. It is this asymptotic value that constitutes a parameter of the 
person. 

A number of writers, Courtis (1933), Bayley (1955), Cattell (1971), and 
Jensen (1973), have attempted theoretical formulations of cognitive 
development. Courtis proposed a two-parameter individual model which 
involved difficuft problems of estimation but did not suggest a form for 
the group curve. Bayley attempted to construct an empirical group 
cognitive growth curve but encountered difficulties in establishing an ap- 
propriate unit of measurement, Anderson (1940), Cattell (1971), and 
Jensen (1973) separately proposed random accumulation models,' with 
Jensen suggesting a separate consolidation parameter which difl'ered 
from person to person. These attempts seem to be trying to account for 
certain known facts of cognitive development separately, namely: 

1 Average ability seems to develop at a negatively accelerating rate 
towards a stable level (Bayley). 

2 Individuals develop at ditVerent rates towards diflerent stable adult 
values (Courtis and, in part, Jensen). 

3 The older the person becomes, and therefore the closer he or she ap- 
proaches the adult value, the more stable IQ measures become when 
taken at yearly intervals. This fact is reflected in the simplex pattern ob- 
tained when standardized ability measures at a number of successive age 
levels are intercorrelated. As pointed out by Anderson (1940), Cattell 
(1971), and Jensen (1973), such a simplex can be generated by means of 
random accumulations over time. Jensen's suggestion that people difl'er 
in the extent to which they can consolidate these accumulations leads to 
growing differences between subjects as they become older but these 
diflerences become more stable. 

These three general findings are jointly consistent with a developmen- 
tal model with at least two individual difl'erences parameters as well as a 
measure of ability on a ratio .scale and a measure of time since develop- 
ment started. However, for the sake of clarity and sy.stem, a developmen- 
tal model with one individual diflerences parameter will be examined first 
and its strengths and deficiencies noted before proceeding to a two- 
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parameter model. A further reason for studying the one-parameter 
iHodel is that the use of the IQ in one or other of its forms implies a one- 
parameter model of development. The possibility of more than two 
parameters will also be explored to the extent of showing how data may 
be examined to justify more than two parameters. Within the one- and 
two-parameter models, the applicability or otherwise of a latent trait 
model or a true score model will be examined. It is important to note that 
the concern here is to model developmental change, not simply to 
measure change in the absence of a model. The question of the 
unreliability of change measures is not relevant to this discussion. 



THE PRINCIPLE OF DYNAMIC CONSISTENCY 
In discussing key problems associated with formal theory in Psychology, 
Marx (1976) lists the problem of individual versus group functions and 
notes: *It is now generally agreed that in terms of mathematical functions 
representing behaviour processes, data obtained from groups cannot be 
freely usedJ.or individual function' (p. 255). Neither he nor apparently 
anybody else is sufficiently disturbed about this state of affairs to suggest 
that it should be a key principle in mathematical models of behaviour 
that the form of the relationship between behavioural measures, stimulus 
variables, and individual differences variables should be the same at the 
group level as it is at the individual level. 

If this principle does not apply then the form of relationship obtained 
with group data may well vary according to the particular sample of in- 
dividual difference parameters present in the group. Under such cir- 
cumstances it is hard to see how general laws based on group data can 
have any general validity at all. Sometimes the principle can be built into 
the model by specifying how the individual data are to be aggregated into 
a group functiop. If this can be done it seems important that it should be 
done. Consider the basic equation of the Rasch latent ability model 
(Rasch, 1960) relating the proportion (P^,) of subjects of ability (A,) giv- 
ing the correct response to a particular item, to the difficulty (D,) of the 
item 



If one considers a test of n items with values of D,, y = 1 . . . //, not all 
equal, the true score value corresponding to ability /I. is nP,,, where P.. is 
the mean value of P„ averaged over all n items for subjects of ability 
The relationship between P,. and A, will depend on the distribution of 
the values of D, but will not in general be of the form: 
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where D. is the arithmetic mean of the values of D,. Thus true score will 
never be related to underlying ability as defined by this model in the same 
way as the individual items of different difficulty which generate the 
score. This conclusion generalizes to all latent ability models using 
nonlinear item characteristic curves and unequal item difficulties. In the 
case of the Rasch model it may be noted that, if the harmonic mean of 
the P„ values, H(P,.) were used to define true score, 

/I +D. 

which is of the same form as the individual item curves. Of course there 
would be problems in defining true score in this way. 

The above criticism of the true score model can be objected to on the 
grounds that it arises from the latent trait approach and that one may 
wish to consider true scores and their frequency distribution without 
assuming an underlying ability. However application of the principle of 
dynamic consistency makes it clear that this principle could never be 
satisfied by true scores for any realistic form of item characteristic curve 
unless all items have the same characteristic curve. In this extreme case, 
Birnbaum (1968) has shown that the number correct score is a sufficient 
statistic for estimating ability irrespective of the form of the item 
characteristic curve, providing it is monotonic. 

In what follows, the principle of dynamic consistency will be regarded 
as fundamental so that it is possible to discuss the form of individual and 
group cognitive development curves without obtaining inconsistencies. 

THE ONE-PARAMETER COGNITIVE 
DEVELOPMENT MODEL 
The" purpose of this model is to relate ability A,u, measured on a ratio 
scale, at time ^ to time ^ as a variable with one individual diflerences 
parameter, c. and a scaling parameter which will be shown to depend on 
the units in which ability and time are expressed. For reasons given by 
Halford and Keats (1978), t, will be taken from birth. For the sake of 
simplicity as well as other advantages noted below, the form of the rela- 
tionship will be assumed to be: 

c,t, + d 
or \/A,^ = c,-\-d/ti, 
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It follows that as ^ increases IM.j, approaches c\ more and more closely, 
Thus the individual differences parameter, c,, is a measure of the asymp- 
totic value towards which the ability of subject / approaches as he grows 
older. It follows that the units w„ in which 1/c, is expressed are the same 
as those in which A,^, is expressed. 

It also follows that H(A j) the harmonic mean of over individuals 
at age can be expressed as: 

\/H(A^,)^^+d/t, 

or //(/I ^ 

c^ + d 

which is of the same form as the individual curve with parameter c, the 
arithmetic mean of the individual differences parameters for the group 
defined. The group curve, defined in terms of the harmonic mean, 
satisfies the principle of dynamic consistency. 

It further follows that the age at which the group curve reaches half 
of its asymptotic value 1/c, can be expressed as: 

t d/c 
or ^.c 

This latter expression indicates that the constant d is expres,sed in units Qf 
in dimensional analysis terms. Thus d is a constant which varies 
according to the units in which time and ability are measured. 

The cognitive development model can be related to other mea.sures of 
ability in common use. For example, the mental age measure used by 
Binet (1908) can be obtained by noting that: 

^ ^ dH(A ,) 
\-cH{A ,) . 

For a subject with ability /I.,, at age ^, mental age may be obtained 
from the formula:. 

/ = ^^^^ 
l-c4.. 

provided /4,,,< 1 
c 

By substituting for the value of A,^ one obtains: 

t.= 

B 
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Thus if ratio IQ is defined as: 



IQ,= 100 -MentaLage_.^ 
Chronological age, 



then: 

100c/ 



IQh is a dimensionless constant. For subjects with asymptotic value 1/c, 
equal to that of the.group curve in terms of which mental age is defined, 
then IQh = 100 and is the same at all ages. For subjects whose asymptotes 
are greater than 1/c, the ratio IQ will increase with age until it is un- 
defined, i.e. /4,fc> l/cT Similarly, for subjects whose asymptotes are less 
than 1/c", the ratio IQ will decrease with age. In general 

Two obvious criticisms of the ratio IQ arise from this formulation. 
The first has been known for many years (se^, for example, Thurstone, 
1926) and refers to the fact that mental age and therefore ratio IQ are 
undefined for subjects whose ability has reached or surpassed the asymp- 
totic value of the group curve. In practice this difficulty has been over- 
come by ad hoc methods (see, for example, Terman and Merrill, 1937) 
but these simply confirm the breakdown of the model. 

The second criticism does not appear to have been noted before. This 
criticism relates to the significant trend in ratio IQ with age for a subject 
whose asymptotic value departs from that of the group curve. The ques- 
tion of whether or not such trends occur can be answered in the 
affirmative from data published by Skodak and Skeels (1949, Table IX, 
p. 146). These data are for two groups of subjects whc^^e natural 
mothers' IQs averaged 63 (Group A) and 109 (Group B). It would be ex- 
pected that individual data would show considerable variability and this 
is so. However the equation 

lOO^(c.-c) 
IQ« d 

can be averaged over two subgroups to obtain: 

<100 •_{c.4-c) 



and 



W(IQ«.) d 

100 ^ (c„-c) ^ ^ , 
//(IQ«») d 
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AGE IN MONTHS 

, Figure 1 Changes in Ratio fQ with Age in Two Groups of Subjects 

The corresponding graphs for group A and group B are given in Figure 
1 and show clearly that for group A the slope is positive which implies 
that c^>c and so the asymptotic value for this group is less than the 
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average asymptote as would be expected from the average* IQ ot the 
natural mothers. On the other hand, lor group B (he slope is negative 
which implies that the asymptotic value for group B is greater than 
average as would also be expected. Thus these data seem to confirm sone 
of the predictions of the one-parameter model and imply that the single 
parameter, c., is at least in part influenced by heredity. On the other 
hand, the intercept on the y axis should be unity according to the model 
but this is far from the case. In fact, if individual graphs are drawn and 
fitted by straight lines in the usual way, almost all of the 19 graphs would 
have intercepts less than one. Thus the single parameter model is only 
partially confirmed. This point will be taken up again when the two- 
parameter model is discussed. 

While the adherents to the true score approach were severely critical of 
the ratio IQ. they were appreciative of the fact that some transformation 
of number-correct score was necessary in most applications. In many 
cases the transformation recommended was that of standardizing to a 
fixed mean and standard deviation. Similar transformations of ability 
measures have been proposed so that individual values can be related to 
particular groups. 

Because (/l.J"' is related to i\ and in a linear fashion, it seems 
reasonable to define a deviation score or \Qn in terms of (^1 It may 
be noted that [//(/l.,)] ' and so 

- ? ~ c\ 

and '<'^^'*>"'(I5)+ 100- ^''"''^ (15)4-100-10,. 

Thus the standard score of the reciprocal of ability at any age level is 
equal to the asymptotic parameter expressed as a standard score with the 
same mean and standard deviation. 

It follows that 10.. and lOw can be related in terms of the model and 
the relationship can be simplified by choosing a scale for the ability 
variable such that a. - 15. With this convention 

10/;= ^ 100 

and ^ (100-IO;>)^ + 1 

I Ok (I 

While this relationship can be defined in terms of the cognitive develop- 
ment model propcsed above and the corresponding definitions of 10« 
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and 1Q,„ it is of interest to see to what extent this relationship holds for 
conventional ability tests. It is possible to estimate 1Q„ and IQa for tests 
for wfiich norms are available for a number of age levels. For example, 
ACER Test, Intermediate D, (1947'~51) has norms for ages 10-14 years. 
According to these norms, raw scores corresponding to 100 1Q„ at each 
of the ages 10.0, 11.0, 12.0, 13.0, and 14.0 are 14, 23, 31, 39, and 46 
respectively. Thus lO-year-olds obtaining these scores would have mental 
ages of 10, 11, 12, 13, and 14 years respectively and ratio IQs of 100, 110, 
120, 130, and 140 but their deviation IQs would be 100, 110, 117, 123, 
and 129. Similar corresponding values could be obtained for 14-year- 
olds. With these data, lOO/lQ^ is plotted against (1Q„~ 100)/* and the 
graph should be linear passing through the point 0,1. Figure 2 displays 
the graph obtained for the nine points available from these data. It 
would appear that the simplifications and approximations used in the 
model have led to a prediction which is not inconsistent with these data. 

Although deviation IQs are never defined in this way, the actual values 
obtained would not difler greatly for the definition above. 

If deviation IQ as usually defined were constant, the value obtained 
would obviously be a good predictor of adult performance. Because of 
unreliability in measurement, it is not sufficient to show instability in lQ,j 
to challenge the single parameter approach. Even systematic trend in lQ,j 
over age in certain subjects would not be sufficient. What needs to be 
shown is that systematic departures from constancy of deviation IQ are 
related to other variables. 

Two relevant studies on this topic are those of McCall et al. ( 1973) and 
Hindley and Owen (1979). The former analysed longitudinal data ex- 
pressed as deviation IQs for a large group of subjects. Between-subjects 
differences corrected for mean value were used to define clusters of sub- 
jects with similar patterns of change. The largest group (approximately 
40 per cent) showed no systematic change over the 10 years studied. A 
second group revealed an increasing trend in IQo up to approximately 10 
years of age followed by a decrease to the original value. At all age levels, 
the average \Qn values were substantially greater than 100 for ttiis group. 
Interviews with parents of all subjects were conducted focusing on child 
raising practices. Parents of this second group were typically *ac- 
celerators', that is they attempted to stimulate the cognitive development 
of their children beyond or ahead of what was done at school. Other 
groups of children showed decreasing trends in 1Q„ and parental practice 
in these cases tended to be either repressive or laissez-faire. The essential 
point is that systematic changes in 1Q„ were associated with different 
kinds of child raising practices. 

In the Hindley and Owen study (1979), similar data were analysed in- 
dividually. loT each subject significant departures f rom constancy, linear 
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Figure 2 Relationship between Deviation and Ratio IQ on the ACER 
intermediate D Test of Intelligence 

trend, quadratic trend, etc. were tested for significance sequentially and 
parameters for linear, quadratic, etc. orthogonal functions estimated 
where appropriate. Parameters were averaged across subjects for each of 
three *^ocial classes and systematic differences noted. In particular, 
children with upper-class parents tended to show the same trend in IQ as 
those in the McCall et al. study with accelerating parents. The Hindley 
and Owen study is an important confirmation of the McCall study and 
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both show that, for some subjects, \Qo is variable and the variation is 
correlated with systematic environmental factors. Thus the one- 
parameter cognitive growth model which could define a deviation IQ 
which is constant with age does not account for at least some of the 
phenomena reliably observed. 

The present model may be compared with the consolidation model 
proposed by Jensen (1973). This model accounts for the simplex 'Struc- 
ture apparent in correlation tables from repeated measures obtained in 
longitudinal studies. Jensen's version is explicated in some mathematical 
detail and is based on the notion of random cumulating increments over 
time. However the model does not yield an asymptotic growth curve and 
no way has been suggested for estimating the consolidating factor, F, 
from test data even though this is intended as the basic parameter deter- 
mining adult level of performance. 

According to Jensen's model, S„, the performance of a particular sub- 
ject after n time intervals, is given by: 

5„ = F.(C,+C2 . . . C„.,) + C„ 

with Gk di random component from experience in some time interval, F, 
the proportion of cumulative experience which is consolidated; and C„ 
the random component which is the result of current experience, uncon- 
solidated. 

If is the mean o£the distribution from which the random com- 
ponents G are drawn, C, the mean of the actual values for a particular in- 
dividual, /, and tk is the time measure corresponding to the intervals then 

where and are expres.sed as experience per unit of time. Thus S„ will 
increase approximately linearly with time at a rate dependent on F,, the 
amount of consolidation. 

As tk increases, the proportion of random variation contributed by the 
second and third terms of this equation will decrease. Under these cir- 
cumstances the correlation between S„ and S„-, will increase with n as has 
been frequently observed in actual data in the form of deviation IQs. 

In the case of the one-parameter model, it has been shown that Var 
= Var(c,) and is independent of age. Thus the correlation between 
values of (1//1.*) at constant time intervals will not vary with age. The 
single parameter model does not predict the simplex structure. 

REVIEW OF PROPERTIES OF THE 
ONE-PARAMETER GROWTH CURVE MODEL 
1 This model does represent a growth curve approaching an 
asymptote. 
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2 Individual ditlerences are reflected in the model in dift'erences in the 
asymptote approached. 

3 The model exhibits dynamic consistency when the group curve is 
defined in terms of the harmonic mean. 

4 The model correctly predicts variation in ratio IQ for subjects with 
above average asymptote as well as for those with bdow average 
asymptote. 

5 With regard to the above relationships, the model predicts an in- 
tercept of unity when 1/1Q« is plotted against time. The value observed 
tor almost all of the 19 subjects studied was less than unity, 

6 The model also predicts that a deviation IQ can be defined which 
will be constant across age levels and this does not agree with studies that 
show that environmental factors correlate with systematic variations in 

7 The simplex pattern which led Jensen (1973) to propose his con- 
solidation model would not be explained by the one-parameter model. 

THE TWO-PARAMETER COGNITIVE 
DEVELOPMENT MODEL 
The need for the introduction of a second individual differences 
parameter arises from the fact that not only do individuals difler in their 
ultimate adult level of ability but they also differ in the rate of ap- 
proaching that level. It has been noted that d/c is equal to the age at 
which the group curve defined in terms of the harmonic mean reaches 
half of its asymptotic value, 1 /c. Thus the parameter d is associated with 
the rate of development. In the one-parameter model it is implied that 
the larger the asymptotic value, 1 /c„ the older the person will be when he 
or she reaches half of this value since this age equals d/c,. However, if 
the parameter d is allowed to vary across individuals, allowance will have 
been made for the observed fact that individuals do differ in their rate of 
development. Thus the two-parameter cognitive growth model may be 
written as: 

c.fk-^d. 
or \/A„=c, + d,/t^ 
and \/H(A,,) = c^J'/t, 

or H(A,,)^^^'. 

ctk + d 

and dynamic consistency is still preserved. It may readily be observed 
that the individual asymptotic value will equal and that half of this 
value will be reached at an age of d,/c\. The units of c, and d, may be 
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shown to btUA^ and u'A^Ut so that the units of di/c\ will be u, as required. 

By the same methods as those used in the one-parameter model, it may 
be shown that 

^^"-{c.7)t,^d.^ 

or 122. (^-?);,^ 1' 
IQh d d 

Thus when lOO/ipw is plotted against /* the intercept will no longer be 
unity unless d, = d. In the case of the Skodak and Skeels data, it was ob- 
served that the intercept was considerably less than unity for almost all 
subjects which implies that the rate of developrnent for these children 
was considerably greater than average since di<d. 

,The data from this study have been questioned, see for example Mun- 
singer (1975)» because of the high ratio IQs reported for most of the 
children. However, as might be expected on other grounds and is in fact 
confirmed in the Skodak and Skeels report, the home environments into 
which- these children were adopted were considerably above average ir 
terms of the cognitive stimulation provided for the children at a young 
age. Such enriched environments lead to a small value for the parameter 
d, and a high value of ratio IQ which would tend to overstate the adult 
value. 

When the two-parameter model is written in the form 

\/A,,^c\ + d,/t, 

it is clear that the influence of the di parameter decreases with age. 
However thec^, parameter is the one associated with rate of growth which 
can readily be influenced by environmental factors, particularly at young 
age levels. Thus the correlation between annual ability measures diX young 
age levels will be influenced by differences in di values as wel\ as d values 
and so will tend to be lower than the corresponding correlation at older 
age levels which will depend increasingly on c. 

As noted earlier, such correlations produce the characteristic simplex 
pattern of correlations observed in many studies and explained by some 
writers, e.g. Jensen (1973), in terms of randrfm accumulation from ex> 
perience and consolidation of these experiences. While such an explana- 
tion is possible, it is also clear that a process of environmental differences 
affecting rate of growth towards an asymptote which is influenced by a 
heredity component can also produce the observed phenomenon. 
However the current model can also account for the patterns of change 
in deviation IQ noted by McCall et al. (1973) and Hindley and Owen 
(1979) and associated with environmental differences. 
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AN INTEGRATION OF TRAIT THEORY, 
TRUE SCORE THEORY, AND THE 
COGNITIVE GROWTH MODEL 

From the principle of dynamic consistency, it is clear that true scores can 
only be explicitly related to ability values when the condition that items 
have identical item characteristic curves is satisfied. This is an additional 
restriction to that imposed by the Rasch model and may not always be 
achievable in practice. However the theoretical and practical advantages 
of tests with equivalent items when prepared for a number of different 
age levels are so great that the additional initial effort may be worthwhile 
at least in some areas. Some of these advantages were noted bv Keats 
(1967). 

The first important property of tests with equivalent items is that the 
number-correct raw score (x) is a sufficient statistic irrespective of the 
form of the item characteristic curve (Birnbaum, 1968, p. 429). This is 
obviously an important property which enables one to classify subjects in 
terms of raw score without particular assumption of the form of the item 
characteristic curve. 

The second advantage of tests with equivalent items is that the con- 
ditions for the binomial error model are met. Thus the regression, mean 
(p/x), of true score on raw score may be written in terms of the frequency 
distribution of ,v, ^(,v) as follows: 

H(x + I ) ^ (n~ X) Mean (p/x) 
g(x) Cv + I ) I - Mean (p/x + 1 ), 

(Keats and Lord, 1962). Keats (1964) set out procedures based on this 
fornuila and the theory of orthogonal polynomials for testing the regres- 
sion of on .V for significant linear, quadratic, cubic, etc, components. 
The resulting simple or generalized hypergeometric distribution of x may 
be written as: 

.vM/?,),(/j,), . . . 

where ((/, ). - ' - a, («i + 1 ) . . . (w, + v 1 ) 

r(a,) 

and A' = ^,'(0) 

In the special case in which only the linear component contributes 
signilicaiillv to ihc regression one may write: 

mean P = ,v+ (1 - r) 
\ n n 
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where r is the Kuder-Richardson formula 21. Various bi-variate distribu- 
tions can also be specified as shown by Keats and Lord (1962) and Keats 
(1964). 

If the regression of true score on raw score is to be constant irrespec- 
tive of the age sample taken, then 

''and'^'(l-r) 
n n 

must be the same for each population. This condition will be met if: 

Var(A-)= x(n-x) 

{n-r{n-\]) 

where T and Var(.v) relate to the same population and r is the constant 
value taken by KR 21 at each age level. If this condition is met, then 
cognitive growth could be measured in terms of true score. Failure to 
meet this condition would imply that the regression of true score on raw 
score is not linear for this particular test and age range. 

Another advantage of tests with equivalent items is that, as Birnbaum 
(1968, p. 458) notes, the maximum likelihood estimate of ability, ^, may 
be written explicitly as: 

§=:l+mlog 

(n-x) 

if the logistic model is used. By appropriate choice of units and taking 
§ = log A/D one has: 

Dx/n 
1 -x/n 

X ^ A 
n A^D 

where D is the common difficulty parameter of the items. Thus raw score 
has the same relationship to ability as do the individual items of which it 
is composed. If in addition true score has a linear relation to raw score, 
then true score may be related explicitly to ability. In any case the 
distribution of estimated ability values may be obtained from the 
distribution of raw scores, g(x), even though the distribution of true 
scores is unknown. As many recent articles have indicated, there are 
difficulties in specifying the distribution of true scores if significant 
departures from the beta function are indicated in the data. One advan- 
tage of working with ability values rather than true scores appears clear 
from this analysis. 



ERIC 



34 



The Improvement of Measurement 



According to the two-parameter cognitive growth model ability A is 
projective on time (t) with parameters (1. 0, c„ d,) and as first noted for 
the case of equivalent items and the logistic assumption, raw score is pro- 
jective on ability with parameters («. 0; I, D). Thus raw score is projec- 
tive on lime with parameters (n, 0; I +c,D, d.D)\ that is, 

X ^ '« 

n (l+c.D)t, + d.D 
If two such tests of n items with difficulty parameters ZJ, and A are 
administered to the same subjects at approximately the same time, then 
the scores on one (.v^) may be chained to the scores on the other (v,) bv 
the formula: 

i.e. reciprocals of raw scores should be linearly equaled. 

MORE GENERAL COGNITIVE DEVELOPMENT MODELS 
The models proposed so far are based on the assumption that \/A , is 
linear on ]/t, which leads to the projective relationship of ability' on 
time, r or various reasons the observed data may depart significantly 
Irom linearity. One obvious departure would occur if environmental fac- 
tors changed to produce a dramatic change in the value of d If d re- 
mained stable at the new value, the graph would consist of two or more 
hne segments and could be analysed as such. More gradual and persistent 
changes iii d. hosvever would produce significant curvature so thai: 

/I.* ft ^ 

might be a possible representation. 

Such significant departures from linearity would make the task of 
predicting and estimating the asymptotic value extremely difficult To 
what extent they occur can only be discovered by studies whitih in- 
vestigate ihe need for a three (or more) parameter model of cognitive 
acvck)piiient. The data for a decision on this question are not availi|ble. 

CONCLUSION \ 
The present paper has aiiempied to review the usefulness of latent trAii 
models as opposed to true score models in the more general context ctf 
cogniiivedcvelopmoni It is clear that, even if only the difficulty of itemA 
m a test is allowed to vary, the true score model has difficulty in pro-\ 
siding a useful represeiiiallon of cognitive growth whereas the latent trait \ 
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model can be readily extended to cover the cognitive growth phenomena. 
This is true even if cognitive development is much more irregular and 
idiosyncratic than evidence so far suggests. The evidence available at pre- 
sent is consistent with the notion that at least two parameters are re- 
quired to represent cognitive development and these parameters could be 
estimated using two administrations of chained tests of the same ability 
at widely separated age levels. The most efficient method of estimation 
has not been ^explored here. 

Although rhe latent trait approach is clearly superior in this context, it 
seems unfortunate that some of the advantages of the true score model, 
for example distribution models, have to be abandoned, Keats (1967) 
and Birnbiaum (1968) noted some of the advantages of equivalent item 
tests. A further advantage noted here arises from the fact that such tests, 
and only these, produce a possible reconciliation of the two models. 
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REACTANT STATEMENT 

Kevin F, Coll is 

In speaking to his paper, Professor Keats has really removed any neces- 
sity for me to react to the specifics in his paper. He has freed me to speak 
about the general notions underlying the differences between the two 
models, true score and latent trait, especially in relation to my own major 
research interest, cognitive development. It should be noted at the outset 
that my interest in the models arose from practical problems, initially in 
the classroom situation, and has never been directed towards the niceties 
of the mathematics involved. 

Like most of ub here, I was brought up on the classical true score 
model, X,^ T,^ E,. However I very quickly began to find it not totally 
satisfactory for practice in the classroom. This was probably because, as 
is pointed out in Keats's paper, one was never quite sure of what wa<^ 
being estimated nor of what the basis of estimation was. Quite apart 
from this, the use of the model led to several undesirable practices, two 
of which I shall mention briefly. 

f'irst, teachers tended to use the various tests based on this model, for 
example intelligence tests, to MabeF and categorize individual children. 
Once the label had been attached, it was almost impo.ssible for the child 
to remove it. Most often, and even more sadly, the child and its parents 
accepted the decision. In classroom practice, once the child had been 
categori/ed as of average IQ (say) then many things followed. The in- 
dividual could be ignored — expectations were determined and a ready ex- 
cuse was available for any one of a number of quite separate school or 
home-based problems. 

The second undesirable practice which had ari.sen, partly because of 
the classical model was what I like to call the 'carelessness* syndrome. 
Teachers* classroom tests were, often unconsciously, based on the true 
score model. They would set a series of items to which the child was to 
respond, the various correct scores were added, and the total score 
represented the childN achievement level in the content area concerned. 
Apart from incorporating many dubious assumptions, this form of 
assessment focused the teacher's attention on the number correct and on 
ranking the children in order rather than on why an individual child did 
not succeed on a particular item. In mathematics teaching, failure to suc- 
ceed on an item was often put down to carelessness especially if the 
overall total score was deemed satisfactory. However, in my experience, 
children were very rarely careless- they seemed, in the main, to take an 
inordinate amount of trouble to try to follow the model solutions pro- 
vided by the teacher. 

It was Wiic'iv these undesirable and clearly unfair practices whicli led 

ERiC 1 / 



The Improvement of Measurement 



me to an interest in what lay beneath the surface. Unwittingly, at first, 
and then more and more consciously I became involved in what is known 
in the jargon as latent trait analysis. Let us look at a couple of particular 
examples from elementary mathematics. 
Children can be asked to find 'X in each of the following statements: 

(i) 3 + 4 = J^ + 3 
(ii) 7-4 = A-7 

Each contains the same number of elements and operations; each uses 
small numbers. Why then should the first be easily afiinable by early 
primary school children and the second not readily achieved until late 
primary/early secondary school? 

The mosi productive method for finding out seemed to be to talk with 
the children at various age levels as they attempted to soke the problems. 
Two strategies for solving the first were in evidence with the younger 
children -neither of which was a satisfactory .strategy for the second, 
riie strategies used were either a low-level pattern seeking-There's no 
"4" on that side\ or some form of elementary ^counting on\ that is, *3 
and 4 are 1, so, we need "4" so that 4 and 3 makes 7\ Questioning re- 
vealed thai the obvious solution of 3+4 = 7 followed by 7-3=4 
does not only not occur (o these children but will be denied as an ap- 
propriate method for solving the problem. Clearly the children needed to 
be able at least to admit the usefulness of this last strategy if they were to 
succeed on the second problem. In reality they needed to do more than 
that. Having obtained *3' from 7-4' they needed to have a sufficient 
overall view of the problem to add^hm subtraction was so fresh in their 
minds and so strongly suggested in the question. 

This increasing ability to solve problems involving more and more 
complex manipulations of the data could be linked, intuitively at first; 
to the cognitive growth phenomena which the Piagetians and neo- 
Piagetians were describing in the literature. Further investigations (Col- 
lis. 1975) enabled logical links to be made between the data and cognitive 
development models. 

As it turned out when the items which had been devisecl were analysed 
using the latent trait model (ACEiR, 1977) the earlier intuitions were 
confirmed. Perhaps more significant, in the context of the present paper, 
psychometricians such as Keats became intrigued by the results coming 
out of a number of cognitive development studies (Keats, Collis, 
Halfortl, 1978) and began to seek a suitable mathematical model to 
analyse the data in a more objective manner and to reconcile any new 
model thus devised with the classical model. As Keats*s paper shows, a 
good start has been made in both these areas. 

In conclusion, then, my experience suggests that both models have 

43 

ERIC 



Cofnparinii iMtent Trait with Classical Measurement Models 39 



their uses in practice — the latent-trait model <>et out in Keats*s paper is 
particularly valuable for investigating the developmental concerns cur- 
rently surfacing in educational and psychological practice. Both models 
can be abused by their users — especially those who do not fully under- 
stand the assumptions underlying the particular model. The classical 
model has been around a long time and so there is much more evidence 
available of its abuse. 1 see this seminar as serving two important pur- 
poses: one, developing a basic understanding of the underlying assump- 
tions of the latent-trait model and two, beginning to reconcile two 
models, one with the other. 1 believe that Keats's paper has contributed 
significantly to these purposes and to the theme of this seminar. 
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The Use of Latent Trait Models in the 
Measurement of Cognitive Abilities 
and Skills 

Bruce Choppm 

MEASUREMENT SYSTEMS 
What motivates most of the work t3 be described in this paper is a wish 
to develop a sounder basis for the measurement of educational achieve- 
ment. I will not dwell much on the short-comings of traditional ap- 
proaches based on the true score concept except where it is necessary to 
point out that it does not lead to a system of measurement with the sort 
of properties that we want. 

In the argument, 1 plan to use the measurement of temperature as an 
analogy. Temperature is a familiar concept and ideas about some objects 
being hotter or colder than others must reach back very far in human 
history. But temperature is an invisible commodity and measurements 
may be made only indirectly. Turning *hot* and *cold' into number values 
on a scale did not come easily, and even when two temperatures (say the 
freezing and boiling points of water) are given arbitrary numerical values 
there is no very obvious procedure for locating intermediate tempera- 
tures on the scale (Middleton, 1966). However, the problems of measur- 
ing temperature have been largely solved in the last 150 years and the way 
in which the measurement system developed contains some useful 
lessons. 

It could be argued that human achievement is a very different type of 
concept. The outcome.^ of it may be extremely visible, and there would 
seem to be no need to turn to indirect methods of measurement. For 
some areas of achievement this is clearly true. If we want to know how 
fast someone can run, we can time them over a fixed distance with a stop- 
watch. If we want to know how high they can jump, we set up hurdles 
and measure them in centimetres or inches. Mental abilities in general, 

41 



42 



The Improvement oJ M?asurement 



and academic achievement in particular, do not lend themselves to a 
direct approach. If we want to know how good somebody is at 
mathematics, we cannot expect a tape measure or stop-watch to give us 
much help. The best we can do is to take a sample of tasks from the 
realm of mathematics and, by observing performance on those tasks, in- 
fer something about a hypothetical level of performance on a more 
general ability, it is in this sen^e that the measurement of academic 
achievement has to be l.idirect, and the trait itself treated as latent. 

What then are the properties we seek in a system of measurement? 
First, as a matter of convenience, we require that the instruments with 
which the measurements are to be mad^e shall be usable over a range of 
values of ihe variable being measured. A thermometer tha^ works only at 
one particular temperature has v<?ry limited value. An unmarked stick 
exactly six feet long may be quite useful for dividing people into two 
groups one of whom all have heights less than six feet and the other all 
greater, but its value as a measuring instrument will be extremely limited 
in comparison with a properly calibrated ruler. 

Further, one would require that the instrument is not unduly sensitive 
to factors irrelevant to what is being measured. Neither thermometers 
nor rulers should react noticeably to changes in humidity or barometric 
pressure. 

More fundamental perhaps is the requirement that instruments should 
be to some extent interchangeable. Ft should not matter which of several 
available thermometers is used to measure the temperatir e of a room or 
the temperature of a cup of coffee. The results obtained .should not de- 
pend upon which thermometer is chosen, and this has implications for 
the calibrations employed. \x\ itself calibration is not a difficult task. Any 
set of marks on a ruler can be treated as calibrations of length and any set 
of marks on a thermotpeter as calibrations of temperature. The raw score 
achieved on a test can reasonably be regarded as a calibration of perfor- 
mance. The problem arises when consistency among the calibrations is 
required so that instruments themselves may be used interchangeably. 

Cross-calibration procedures have something in common. To calibrate 
two thermometers one against the other, you might use both to measure 
the icmpcraiurc in several situations (say freezing and boiling water and 
a number of points in between) and observe carefully the readings on 
each thermometer. To cross-calibrate two tests, a straightforward pro- 
cedure would be to give two tests to a number of people and to observe 
the raw score of each person on each test. In this way a table could be 
developed to show how the raw score on one test was related to the raw 
score on the other. This, in a limited sense, is a basis for interchange- 
ability since if a person's score on one of the tests is known it would 
always be possible to predict his score on the other. 
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Unforiunaiely this does not work too well in practice. Firstly, the 
errors o\ measurement usually present with the test scores are of such a 
magnitude that in a real-life situation a single raw score on test A is likely 
to correspond to a whole range of raw scores on test B. (The same sort of 
thing happens with thermometers but the measurement error there is so 
small that it is usually ignored.) Secondly, all pairs of instruments have 
to be brought together lor the cross-calibration, and this rapidly becomes 
impraclicablQ as the number of instruments for measuring a particular 
variable is increased. If 1 develop a new test of arithmetic in a world 
where 100 other tests of this topic already exist, then in theory 1 need to 
carry out 100 cross-calibration experiments in order to make my new test 
fully a pan of the measurement system. To solve this problem for 
temperature, constructors of thermometers make use of the apparently 
regular, though dilVcrem, expansion properties of solids, liquids, and 
gases, as tempc-aiure is increased. Most thermometers make use of these 
expansions so that an indirect measure of temperature is obtained by 
making a direct measurement of length. Equal changes in length are said 
io correspond to equal changes in temperature, and this makes the con- 
struction ot a variety of types of thermometer relatively straightforward. 
Calibration is carried out against a standard thermometer at only two 
points on the scale, the rest of which is marked ofV in equal intervals of 
length. With this system, one can use a number of dilVerent thermometers 
with confidence that a reading of 47 degrees on one of them means more 
or less the same thing as a reading of 47 degrees on any of the others. The 
consistency achieved by real-life thermometers is frequently exaggerated. 
Mercury-in-glass and platinum-resistance thermometers which agree at 
100 C will difVer by about TC at 30(rC (Nelkon and Parker, 1968). 
Neither is necessarily true or false; they represent two facets of an in- 
herently inconsistent system. Though less dramatic, similar inconsisten- 
cies occur among liqiiid-in-glass thermometers. Many difVeren^ liquids all 
with dilVerent properties were tried during the first hundred years of ther- 
mometry and the general agreement in the 18th century to standardize on 
mercury as (he liquid seems to have been arbitrary (Eyseiick, 1980). 

With mental tests, a major ellori to gel around the calibration problem 
has come lo he known as norm referencing. Here a hypothetical scale of 
performance is defined by (he distribution of ability within a particular 
population. If a particular test is administered to the whole population 
(or a representative sub-sample of it), then it is possible to defme a 
transtt rniation of raw test scores into, for example, percentiles. Thus a 
score ol on test A may be held to be equivalent to a score of 36 on lest 
B, if both translate to the same percentile value for the same population. 
.As past experience has shown that a normal distribution of ability is a 
reast)nable hypothesis to hold for most populations, the procedure to 
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establish percentile norms tor a new test need not be arduous. (Age 
norms take these procedures one step f urther by using (he average per- 
formance of each of a whole set of different sub-populations). 

It is instructive to consider norm referencing for temperature. Suppose 
thermometers were all calibrated in terms of the percentage of days that 
were cooler than a particular temperature. They would have their uses, 
but clearly also their limitations. They would not be much use for 
measuring the temperature of cups of cotVee or human body 
temperatures. In the context of weather, the calibrations would be mean- 
ingful only in the restricted context of the climate where the calibration 
was carried out, and even then they would not be much use for night- 
time temperatures. Further one might well say, Today is a cool day con- 
sidering the whole year, but is it cool for the end of May?'. Normed 
calibrations would not contain much information about that. 

These limitations are just those that restrict the usefulness of the norm- 
referenced standardized test. The calibrations are only strictly relevant in 
the context of the reference population and, in real-life situations, it is 
almost always true that the population of interest will not be the one on 
which the calibration was carried out. Human populations are not par- 
ticularly easy to defme and, in particular, the characteristics of student 
populations usually change quite rapidly, so that a standardization 
carried out one year may already be noticeably inaccurate twelve months 
later. Further the idea of a single population is often unhelpful. In- 
dividual children need to be considered in terms of their own 
characteristics, and not merely as representatives of a national popula-^ 
tion. Sex, ethnic origin, and educational background may all be crucial 
to the interpretation of a particular test performance. 

If norm referencing is ^...1 the answer, then perhaps we should seek 
some theoretical basis for test interpretation analogous to that which 
turned the problem of the measurement of temperature into essentially a 
problem of the measurement of length. Just as the relationshir of the ex- 
pansion of materials with temperature can be expressed (app ^xinidtely) 
in mathematical form as an equation so we seek a formal mathematical 
representation of the way that performance level on an achievement trait 
translates itself into observable performance on a mental test. It should 
be clear by now that the model implied by adding up the number of 
correct responses and using the resulting score as a measure of achieve- 
ment will not do. In general the same person would get different scores 
on different tests even (hough his achievement level remains constant, 
and so we must somehow build information about the content of a par- 
ticular test into our model. 

A single test item is analogous to the six-foot long uncalibrated 
measuring rod we mentioned earlier. It can divide people into two 
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groups: those who can answer it correctly and those who cannot. A test is 
like a bundle of rods of different lengths. A height-measuring system 
based on them must take account of which rods are in the bundle. 

In the last 20 years or so, a number of alternative models of test 
behaviour have been advanced. In this paper I shall concentrate on one 
of them which seems to have by far the widest range of application. This 
is the Rasch model which relates the probability of a person responding 
correctly to a particular item to a function of just two parameters: the 
ability of the individual and the difficulty of the item (Rasch, 1960). 

It is usually written: 

Probability^i/..- 1 (I) 
L J 1 -f ^f'^'v-*.) 

where a,, - I represents the event that person v responds correctly to item 
/; a,, is a parameter measuring the ability of person u and d, is a parameter 
measuring the difficulty of item /. W is a constant which determines the 
si/e of the units of measurement. For measures in logits, IV is set at e^ 
For measures in brytes or wits, W is set at 3*^^ which is about L2457. 

Certain assumptions are built into this model. The first is that prob- 
ability of a particular response being correct does not depend on which 
other individuals arc attempting the same item, or on the pattern of 
responses thi :se individuals might give. More importantly perhaps, it 
assumes that the probability of a correct response to a particular item 
does not depend on which other items make up the test, in which order 
they appear, or what responses were given to the items that preceded the 
one under examination. This assumption Is known as Mocal in- 
dependence'. Secondly the model assumes that the individual's response 
is conditioned by his ability to answer questions in this area, but not by 
his motivation, his tendency to guess, his degree of hunger, or indeed any 
other personal attribute. Thirdly, the model assumes that one and only 
one item parameter (difficulty) alTects the outcome and that other item 
characteristics (such as reliability or discrimination) are not relevant. 

In both Britain and the United States, the educational literature still 
occasionally produces an outraged statement by a respected traditionalist 
that he has looked at what underlies all the fuss over the Rasch model, 
and he has found that the model simply is not true. If this can be taken to 
mean that the Rasch model does not exactly represent the behaviour of 
real people in actual testing situations, then let me hasten to agree. The 
Rasch model is a gross simplification, deliberately designed to provide an 
approximate representation of reality, not an exact one. That indeed is 
the virtue of scientific models. Is Charles' law about the expansion of 
gases true? No. of course not. Neither is Van de Waal's Equation of State 
a true account of the behaviour of gases under changes in temperature. It 
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is more accurate than C harles' law, but still an approximation. Newton's 
laws of motion are themselves no more than an approximation. They 
work very well most of the time, but are woefully inadequate in some cir- 
cumstances. 

More telling is the criticism raised repeatedly by Goldstein and others 
in the United Kingdom (e.g, Cioldstein 1979) that the Rasch model is un- 
sound because its basic assumptions are untenable. That they are 
untenable is not really open to dispute, but I would argue that this again 
is not a sufficient reason for dismissing the model. After all, the use of 
flat maps to represent portions of the earth's surface involves assump- 
tions about the preservation of relative areas and distances that we know 
to be untrue. Lvxactly what does a scale of 1:1 000 000 or 10 miles to the 
inch imply? Yet two-dimensional maps are almost everywhere regarded 
as being useful aids to personal navigation. Our experience to date with 
the Rasch model suggests that it is quite robust with regard to violation 
of its assumptions. Even when items have a built-in dependence upon 
another and when parameters such as discrimination vary widely, the 
results obtained when the Rasch model is used to measure people show at 
most onlv very minor inconsistencies. 

There arc, I would submit, three separate reasons for adopting the 
Rasch mode! as the basic scaling technique for measures of achievement, 
r hey are: 

1 It is maiheniatically simple and convenient to use. Metliods of 
estimating the parameters are relatively straightforward and they do not 
require vast amounts of data. 

2 The model is in fact a direct extension of current testing practice 
which adds up the number of correct responses and uses this as a 
measure. In fact a number of authors have shown that under normal cir- 
cumstances the raw score on a lest is a sufficient statistic for the ability of 
the person achieving it. That is, all the information about the person's 
ability contained in the set of responses he gives is concentrated in the 
raw score. Similarly all the inforoiation about the relative difficulty of 
items is contained in the set of facility indices (i.e. the proportion of cor- 
rect responses item by item). Thus, If we have a complete data matrix 
rcujlting from each of a particular group of people attempting all the 

^ jtems in a particular test, where one is reported in the matrix for a correct 
response and zero for an incorrect response, then the marginal sums of 
this matrix contain all the information necessary for calibrating these 
Items and measuring the people. The Rasch model provides support for 
the Use ot raw scores for a variety of measurement purposes such as the 
ranking ot students. There Is a one-to-one monotonic relationship be- 
tween the raw score and the underlying latent trait scale. 

^ The Rasch model does appear to predict the behaviour of real test 
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items and real people with eonsiderable aecuracy (given ihe enorniily of 
the simplifying assumptions). 

A siraighiforward illustration of this is given below. When one item is 
more ditHeult than another, this manifests itself by a tendency for people 
who succeed on it also to succeed on the other item. Information about 
the relative difficulty of the two items is contained in the respective suc- 
cess rales; for a given group of people. 

The Riisch model leads to the somewhat surprising (but easy to 
remember) result that, for any two items (/, 7) measuring the same ability, 
the ratio of the number of people who respond correctly to / and incor- 
rectly to J (say /;,,), to the number who do the opposite (say b^,) should be 
constant - a measure of the relative difficulty of items / and 7— no matter 
what the ability level or distribution of the people (Choppin, 1978), 

In the notation of equation (I) 

- Probability \a,, = l, g,^ = 0| 
Probability = a,,= \\ 

and (6, 6^ is estimated by '^^ ^" ~ '^^ ^" 

log W 

\ o illustrate this, I have looked at four items used in the 1971 lEA 
Science survey (Comber and Keeves, 1973), The data I have are from six 
separate samples of about 1000 pupils ranging from one of eighth grade 
pupils in non-academic streams to one of twelfth grade academic pupils. 
Traditional facility values for a single item vary Widely from one sample 
to another reflecting the varying abilities of the pupils. Two of the four 
items are in chemistry and two in biology. For each pair, the values of/?,, 
and /),, were counted in each sample. The results are plotted in Figure 1. 
[ he relative difliculty of the two items, according to the Rasch model, is 
given by the slope of the line joining the origin to the sample point. You 
will note the consistency tVom one sample to another despite the extreme 
variation in ability. 

These data are not *cooked\ The items were drawn at random from the 
set that were aJniinistered across the wide age group. Discrimination in- 
dices range from 0. 14 to 0.36, Both biology items were said to be measur- 
ing 'understanding' but, of the chemistry pair, one was classified as 
^knowledge of facts' and the other as ^higher mental processes*, 

.So, as well as illustrating the sample-free aspect of Rasch relative 
difficulty, these data give some insight into the robustness of the model. 
The structure holds up well even when the departures fVom the under- 
lying assumption of homogeneity and uniform discrimination are quite 
suhstaniial. 
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Figure I ResuiLs for Four Science Items on Six Widely Differing 
Samples Each of about 1000 Students (For a pair of items 
within one sample, b,, is defined as the number of individuals 
who responded correctly to item / and incorrectly to item 7.) 
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MORI C'OMPLhX MODELS? 

In I he L'niicd Siaies the Rasch model has often been referred lo as the 
*one-paranieler' model because it uses only one parameter to describe a 
test item (the dithcuhy paramcicr), A 'two-parameter* model has been 
proposed and investigated from a mainly theoretical point of view. Like 
the Rasch model it relics on a single parameter to describe the person tak- 
ing the test (ability) but, in addition lo difficulty, it introduces a 
parameter lo represent the item's power of discrimination. Largely in^ 
response to theoretical objections resulting from the use of the multiple- 
choice format, a 'three-parameter* model has been suggested. This still 
retains a single parameter for the person involved, but adds another item 
parameter representing the *guessability* of the item (i.e. the limiting 
probability that a person with absolutely no relevant ability would still 
respond correctly to the item). Though there is a substantial amount of 
published discussion of this model, it has been little used in practice. 

In passing ir should be noted that there has been discussion about the 
advantages of usuig a probability function based on the normal curve 
(proposed bv I ord, 1952) rather than the logistic (exponential) functions 
adopted by Rasch and Birnbaum, In quantitative terms it appears that 
this would make vcrv little ditVereiice to the results obtained and the 
logistic form of model is now generally preferred because of its relative 
mathematical simplicity. 

Ihe reason for interest in these more complex models is clear. The 
simplifications made by the Rasch model are rather extreme, and a more 
complex mode! could be expected to provide a better fit to real data. 

A^amsi this one must set the disadvantages of losing the simple one- 
to-one relaiionsfup of latent trait measure with raw score. The more 
coniplcx models require very lengthy computation in order to score the 
test. Lveii to rank candidates on a very short test, these models will 
alnu)st inevitably require the use of a computer, 

Secotidly, while quite usable estimates of the Rasch model parameters 
can be i)btained from as few as 30 catididates attempting a test, the more 
complex models seem to require samples running into the thousands in 
order \o obtain a similar decree of reliability in parameter estimation. 
Observations c(Jnie as single bits of evidence, atid it seems to be ditlicull 
to squeeze more than one item parameter out of a single bit. It is for this 
reason, I suspect, that Wright found that, in practice, the more complex 
models tilted real data less well than did the Rasch model. 

I he ortluuiox view among Rasch scalers is that it is better to avoid the 
pr(>blcms which prompt the introduction of extra item parameters. A 
good test from the Rasch point of view (and hence also from the point of 
view ot those who usc traditional test statistics) is one which avoids 
substantial amounts of guessing and items whose discrimination 
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parameters vary widely. Where the Rasch model is used as an aid to test 
construction, it is usually employed in this mode. 

PARAMETER ESTIMATION 
It is now nearly 20 years since Wright and I began to play around with > 
ditVerent computational algorithms lor solving the Rasch model, A good 
many other people have come in on the act since then, and it would be 
fair to say that by now this particular topic is fairly thoroughly explored 
and Hocumented. 

That there is a problem requiring a solution at all results from the 
stochastic or probabilistic form of the Rasch model itself, |t gives only 
the probability of a correct response to an item whereas the actual obser- 
vation is of success or failure. There is unfortunately no way to observe 
probabilities directly. t 

In general, however, if we have N people responding to the k items in a 
lest we have a total of Nk bits of information to estimate the /V-f A: 
parameters {S abilities and k difficulties). The problem is to find values 
of the parameters that best explain the set of observations, and then to 
check that this explanation is good enough to justify the use of the Rasch 
model in these particular circumstances. Many ways of fixing the values 
of the parameters have been suggested; some precise but computationally 
rather long-winded (Andersen, 1973); some rough and ready but easy to 
calculate (W right and Stone, 1979), In general they can be grouped into 
two separate categories: 'least squares' and *maxinium likelihood'. 

The least squares approach is based on the idea of minimizing the 
discrepancies and deviations from the model. It is the approach sug- 
gested by Rasch in his book, and was the first method to be investigated 
in any detail. 

The maximum likelihood approach begins by specifying the prob- 
ability of a particular set of observations, given particular values for the 
parameters. The procedure then calculates the values of the parameters 
that make this probability function a maximum. In practice the log 
likelihood function is maximized as this is computationally rather easier, / 
and leads directly to estimates for the standard errors of the parameters. / 

Some but by no means all maximum likelihood methods produce a/ 
systematic bias in the parameter estimates obtained (an approximate cor/ 
rection factor has been proposed to take care of this). Some but by ryb 
means all maximum likelihood methods are computationally lengthy a^d 
hence rather expensive. My own experience suggests that, although/the 
quickest inasinnim likeliliood method (producing unbiased re.sults)A:an- 

^ In this case (as in main others) it would be quite impracticable to fac(/an in- 
dividual with the same lest item on a large 'number of occasions in x>rdcr to 
estimate the probabiliu b> \va\ of the relative frequency of success. / 
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not compete for speed with the shortest of the least squares approaches, 
it is somewhat more stable with badly conditioned data. Where both 
methods are applicable to the same set of data, it is comforting that the 
results usually agree to within a tenth of a standard error, and I have 
never seen a case where real (as opposed to artificial) data gave rise to 
least squares and maximum likelihood estimates that differed by more 
thfin half a standard error. 

EXTENSIONS TO THE BASIC RASCH MODEL 

Most of the published accounts of the use of the Rasch model refer to the 
standard situation where each of a group of people attempts all the items 
in a test, and where each item response is scored either right or wrong. 
The last few years, however, have seen a great deal of work in developing 
extensions to the basic model to cope with more complex testing situa- 
tions, I will consider three such extensions. 

Incomplete Observation Matrice.s 

If every person in a group attempts every item in a test, the data can be 
arranged as an V by A* rectangular matrix of ones and zeros, where one 
represents a correct response and zero an incorrect response, as in Figure 
2 

In this case the ( V f A) marginal values (i.e. the row and column sums) 



[[cms 



I'crsoiis 





1 


-> 


3 


4 


5 


6 




A 


1 


1 


0 


1 


0 


0 


3 


n 


1 


0 


1 


1 


1 


0 


4 


( 


1 


{) 


I 


0 


0 


0 


7 


I) 


1 


1 


0 


1 


0 


0 


3 


[ 


{) 


1 


0 


0 


0 


1 


2 


I 


1 


1 


1 


0 


0 


0 


3 


Ci 


i 


0 


1 


I 


1 


0 


4 


n 


1 


I 


0 


1 


1 


0 


4 


1 


1 


1 


{) 


0 


0 


0 


-> 


1 


! 


1 


1 


1 


0 


0 


4 


K 


0 


0 


1 


{) 


0 


0 


1 


1 


1 


1 


0 


0 


0 


0 


-> 




10 


s 


6 


6 


3 


1 


Marg 



Fiiiure 2 H>p(>thetical Data Matrix for Results of 12 Persons on a Six 
Item Test 12, A- = 6) 
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contain enough information to estimate all the parameters. However, 
many reaMifc situations occur in which this matrix is incomplete. 
Although it will then have less, than Nk bits of information, it will still 
usually have more than enough to develop estimates for the N^-k 
parameters. With missing data, the parameters cannot be estimated from 
the margins. 

There arc several different situations in which omissions not resulting 
from a candid ate\ inability to answer a question can occur. One is 
through some irregularity in the test administration. Some candidates 
may be given test booklets containing pri,iting errors; one may be taken 
ill half-way through an examination. 

Another such situation is a test or examination in which the candidate 
is allowed a choice of questions. This procedure is fairly standard in 
British public examiiiations where a typical rubric might run ^Answer 
questions I, 2, and any five frt^^^m numbers 3-15\ It is widely believed that 
such a procedure is fair to candidates who may (through no fault of their 
ow n) have missed being taught some parts of the curriculum. Be that as it 
may» it can easily result in a situation where> although an examination 
consists of .\' items, no student vector contains more than n 
responses - and of course different students choose ditferenl questions, 

:A third situation in which the data matrix is not complete occurs when 
different tests are quite deliberately given to different students. 
Sometimes this method is adopted to prevent students from copying the 
an.swers of their neighbours, and hence to increase the overall security of 
the examination, but it is also a quite legitimate way of increasiiig the 
range of items on which achievement data are gathered without unduly 
lengthening the time' devoted to testing. One example of the latter kind 
occurred in the lEA 1971 Survey of Science Achievement which 1 men- 
tioned earlier. For students in the pre-university year six different forms 
of the test were prepared. Each consisted of a basic sub-test of 60/items 
which appeared in all forms, and six more advanced items draw/i from 
an additional pool of 36. This ensured that, while no student was asked 
to respond to more than 66 items, achievement data on a total of?96 items 
were obtained. Another pertinent example occurs in the work mc NFER 
is doing as part of a national monitoring of standards for me govern- 
tiient's Assessment of Performance Unit (APU). For the a/sessment of 
primary school mathematics, 26 difTerent forms of test we/e used, each 
one containing only one-thirteenth of the total pool of items, (The test 
was so arranged that each item appeared in two separate/forms,) When 
observations from these types of testing are arranged ii/ a persons-hy- 
//ewj>~matnx (see Figure 2), it is clear that large parts of the matrix will be 
empty. Y^t^^voti4d.,^^MI be desirable to be able to calibrate all the items 
one against the otherTaridniijTieasure the achievemeny^of ^/// the people. 
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Another, and rather more novel, type of ^missing data* occurs as a 
result of editing an observation matrix, often as a result of a Rasch scal- 
ing analysis. For example, after estimating item difficulties and the 
abilities of candidates, it is possible to use the model to examine the 
probability or improbability of each separate item response. This method 
has been used to identify lucky guesses (a student correctly answers a 
question that appears from the rest of the data to be far too difficult for 
him). Further, one may discover that a group of testees respond in ap- 
parently eccentric fashion to perhaps a sub-group of items, suggesting 
that perhaps those particular items are not appropriate for them. It is 
possible, in this way, to identify instances where the item response ap- 
pears not to be a good indication of the candidate's overall level of 
achievement and these instances can be edited out before a second 
analysis of the persons-by-items matrix. This matrix will now have some 
holes. 

How is the incomplete matrix analysed? The answer comes from the 
result mentioned earlier for pairs of items. The ratio of the probability of 
getting item / correct and j incorrect to the probability of getting / incor- 
rect and j correct is a simple function of the relative difficulties of items / 
and y. When faced with an incomplete observation matrix, we decom- 
pose the test into all possible sub-tests of length two. Any individual who 
responds to more than one of the original k items contributes to the 
estimation of the relative difhculties of at least some of the possible item 
pairs. Once all the items have been calibrated, it is relatively straight- 
forward to look at the set of respon.ses a particular person gave, and 
derive from them an estimate of that individual's ability (Choppin, 
1978). Both ma.ximum likelihood and least'squares methods work in this 
analysis of incomplete observation matrices and, although over the years 
my preferred method lias usually been that of maximum likelihood, I am 
now coming to the conclu.sion that in routine testing situations the non- 
iterative least squares algorithm is going to prove the more reliable. 

Pariial Credit 

.Suppose that instead of being scored zero/one, a set of item responses 
has each been scored on a scale from zero to five depending on the degree 
of Vorreciness'. Tests of this sort are not unknown in Great Britain, 
though I do not know whether you have to deal u ith them in Australia. 

Two difVereni methods have been developed for analysing such data 
with a Rasch model. The first, stemming from the work of Wright and 
his colleagues in C hicago, is to treat each item in the example given above 
as replaceable by live dummy items each of identical difficulty and scored 
/cro^one. The score actually achieved on one of the original items is thus 
taken to be the raw score obtained by summing the scores on the sub-set 
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of dichoioinous duniiny iicins. Since the dummy items associated with 
one real item are assutned all to have the same difticulty value, the 
margins of the observation matrix may be collapsed somewhat before the 
conventional parameter estimation method is applied. 

A\n alternative approach, and the one that I have been developing, is to 
convert the actual score awarded (on the 0-5 scale) into a fractional score 
on a continuum from zero to one. Thus in the example quoted above, a 
score of 4 would be converted to one of 0.8, indicating that the candidate 
has achieved four-fifths mastery of that particular question (and one- 
fifth lack of mastery). On a second item for which the score is three, I 
deduce a degree of mastery 0.6 and lack of mastery 0.4. 

A simple extension of the Rasch model to accommodate partial scores 
between zero and one, in lieu of dichoionious scores, replaces the prob- 
ability function by an expected value. F-rom this we can use the scores 
achieved on two items by the same individual to give an estimate of the 
relative didicultv of the items. 



Expected value E 



5 



(3) 



This equation is analogous to equation ( I ) above. If we call the two items 
in our example / and /. then the relative ditticuliy of the items is estimated 
by 

(4) 

log W "LiUS - aJ 

The items are (hen calibrated by looking at all possible item pairs as in 
the missing data method outlined above. 

I have not much more to report about the use of non-dichotomous 
scoring systems at the moment. Both methods of analysis are in use. 
Sometimes they produce virtually identical results (e.g. where all the 
items in a test are scored on the same scale). On other occasions the 
results may be somewhat dilferent, and it is up to the analyst to decide 
which of the approaches is the more sensible in that particular case. The 
first method I described weights questions according to the maxiinum 
number of marks awarded for them. The second method weights all the 
questions equally. 

Early results that I have had suggest that non-dichotomous scoring can 
give much more information about a candidate from a limited number of 
item responses. It can thus substantially reduce the standard error of 
nieasuremeni. C)n the other hand it is in general rather harder to meet the 
Rasch model requirement of homogeneous discrimination levels when 
non-dichotomous scoring is used. 
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Markers or Judges 

This extension to the basic model i' intended to cope with the situation 
where several dift'erent judges provide ratings of the quality of some per- 
formance, 1 think that the best way to present this will be to describe in 
some detail an actual problem, and the progress made so far towards its 
solution. It arose in connection with background work carried out for 
the Assessment of Performance Unit that I mentioned earlier. !n our 
national monitoring of writing skills it was deemed appropriate to have 
each writing sample graded by expert judges. Since it would clearly be 
impracticable to have any one judge consider all the writing samples 
obtained in a large national survey, we decided to explore the feasibility 
of recruiting a pool of expert graders on whose judgment we could rely, 
and whose variation on a severity-leniency scale could .be minimized. To 
accomplish this the following experiment was conducted. 

Seven hundred and fifty students provided writing samples for 
analysis, FUcvcn separate writing tasks had been defined and each student 
was asked lo respond lo two of them, one from the set (task 1-task 10) 
and also task 11, In the experiment four markers were used, although 
only two marked each candidate's papers. Marker one was never paired 
with marker three, and marker two never with marker four. Apart from 
this all combinations of markers appeared with approximately equal 
frequency: 

Each marker was required to grade each task on four separate criteria. 
Vox the sake of brevity these criteria will be referred to as content, gram- 
mar, style, and orthography. All grades were made on a 1 - 5 scale with 5 
being the best work. 

Thus for each of the 750 candidates, we had 16 scores (two tasks x two 
markers x four criteria), A Raseh scaling was carried out in order to 
estimate the marker parameters and the task/criteria parameters. These 
parameters were estimated to provide the best possible fit of the data to 
the model, f irst, 



where a is the writing ability parameter for a student, h is the dilliculty 
level parameter for a criterion on a particular task, /;? is an adjustment 
for marker severity and A' is the grade awarded. The expression (X ~ l)/4 
converts the grade to a fractional score on the zero/one interval. This is 
analogous to the procedure I described in the preceding section on partial 
credit, l or (he present experiment it should be noted that the estimation 
has no solution unless at least some interactions of task/criterion are 
graded bv more than one marker, and some individual pupils are graded 
by the same marker on more than one task. 




1 ^ ^'c-.-A ^ 



(5) 
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If these condiiions are met 

is estimated by 0(5 -A-,) 

where /, j arc indices indicai ng particular task/criterion combinations 
(henceforth called items) and the surrmiations are taken over all pairs of 
pupil and marker for which grades for both items / and j exist. 
Similarly, 

is estimated by U(5- A'J 

where X, and X^ are the grades awarded by marker / and marker g, and 
the summation is taken over all pupil responses to individual items for 
which bo:h marker /and marker g provided grades. These two sets of 
equations were analysed using both least squares and maximum 
likelihood methods. In no cases were the resulting calibrations different 
from each other by more than 0. 1 wits. All the results reported below are 
drawn from the maximum likelihood analysis. 

Averaging the resuhs over criteria and tasks gave the overall difficulty 
levels shown below: 



(I) Tasks 



Tasks 


Mean dillkiilty 


task 1 


50.8 


task 2 


49.6 


task 3 


50.5 


task 4 


49.4 


task 5 


49.7 


task 6 


50.8 


task 7 


49.7 


task 8 


49.4 


task 9 


49.5 


task 10 


50.3 



(ii) ( rucna 

criterion (a) content - adjiisttnent 0.7 wits 

lasks cuicrioiHl^) graniinar - adjiisiincnt zero 

I 10 criterion (c) style - adjustment f 0.3 wits 

criteri(Mi (d) ortlu^grapiiy — adjustment > 0.4 wits 

criterion la) eonieiK — 51.5 wits 

I ask criterion (b) gramtiuir — 5K4 wits 

li criterion (c) st>le - 48.7 wits 

criiericM) (d) orthograpln — 49.5 wiis 

Vhc inierpreiation of these results was that, for example, task 3 was on 
average one wit harder to score well on than task 9 (50.5-49.5) and that, 
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for tasks I-IO, the content grading was about one wit more lenient than 
the style grading ( -0.7)- ( + 0.3). 

Task II was reported separately because it was common to all 
students, and the pattern of marking was substantially different. 
Whereas ^content' was the easiest criterion on which to score for tasks 
l-IO, for task 11 ^» was the most difficult. Both content and grammar 
were marked more .severely for task 11 than for other tasks or other 
criteria. Was this perhaps the result of overexposure of markers to 
responses on this task? In any case, since everyone took task 1 1, varia- 
tions on it from the other tasks did not introduce any bias into the 
estimation of student attainment. 

The variations present in tasks 1-10 did have implications for 
measures of student attainment. If the outcome was taken to be a score 
derived from adding the eight separate grades provided by each of the 
two markers, then we can see that half of these grades depend to some 
extent on the choice of task. For a student of approximately average 
ability, one point on each grading scale is equal to about 4 wits. Hence 
the discrepancy of 1 wit between tasks 3 and 9 suggests a difference of 
about a quarter of a score point for each grade awarded. The sum of this 
would produce a difference of about 2 points in the total score; e.g. a stu- 
dent who took tasks 3 and 1 1 and got a total of 47 would be expected to 
score 49 on tasks 9 and 1 1 . Although most discrepancies arc smaller than 
this, some are larger, suggesting that (Rasch) scaling of the raw results 
was highly desirable. 



The Results ~ Markers 

Adjustments for variations of severity of markers are shown below: 

Criterion 



Marker 


C on tent 


Ciramnuir 


Siyio 


Orthography 


.All criteria 


1 


0.4 


0.7 


0.9 


0.2 


0.2 




0 


0.6 


1.8 


0.3 


0.5 


^ 


0 


0.3 


0.4 


0.2 


0.1 


4 


0.4 


0.2 


2.3 


0.3 


0.6 






(SL-. 


0.5) 




(SL: 0.3) 



These results showed thaf there were no significant marker effects except 
on the criterion, style. 

On style the discrepancies were substantial: 

On average, marker 1 was 1 wit too severe, 
marker 2 was 2 wits too severe, 
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marker 3 was about right, 
marker 4 was 2 wits too lenient. 

The structure of the data meant that it was not possible to test the con- 
sistency of these results over each task separately, but it was possible to 
compare grading standards on task 11 with the remainder. 

All criteria Style 
Marker Tasks MO Task 11 Tasks MO Task 11 



1 -0.3 0.6 -0.3 2.2 

2 0.7 0.4 1.6 2.1 

3 -0.2 0 0.2 -1.0 

4 -0.1 -1.0 -1.4 -3.3 

(SE = 0.4) (SE = 0.7) 



From this it appeared that the main discrepancies between markers oc- 
curred on task 11. The difference between markers 1 and 4 on task 11 
*5tyle' was 5.5 wits whereas on the other tasks it averaged only 1.1 wits. 

Since all students responded to task 11 the effect of differential marker 
severity on total score was considerable. The effect of having markers 1 
and 2 rather than 3 and 4 was about 2 score points on task 1 1 and 1 score 
point on the other task, or about 3 points in total. Overall the results 
appear to confirm thai the criterion *style' was not being used in an accep- 
table fashion by the markers. The rest of the calibrations appear satisfac- 
tory. Marker calibration has been achieved, and on the basis of this set of 
data it seemed reasonable to conclude that there were no systematic 
differences of severity between the standards adopted by the four 
markers in this experiment. 

MONITORING OVER TIME 

My colleagues at the NFER are now heavily engaged in various aspects of 
Rasch scaling and the chief reason is the British version of national 
assessment that I mentioned earlier, the APU. It is appropriate therefore 
to spell out the precise nature of the problems raised by our work for the 
APU, how we are proposing to solve them, and the sorts of results we 
hope to produce. 

The APU program includes the monitoring of achievement standards 
within our school system through the administration of tests to random 
samples of children at two or three different age levels. At the moment 
the cycle in each subject is an annual one (e.g. in mathematics, we ad- 
minister tests to a sample of 10-year-olds each May, and to a sample of 
15-year-olds each October) but it is possible that the testing frequency 
will be reduced in the future. The aim of the program is to provide a 
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detailed description of the attainment of children in our schools on a 
wide variety of test material, to try to identify external factors associated 
with low patterns of performance over time. The aims are very similar to 
those which motivated the National Assessment for Educational Pro- 
gress (NAEP) in the United States, and in Britain too we have had a 
debate over the merits of a purely descriptive approach in contrast with 
atiempts to identify causal links between academic performance and 
background variables. However, for the moment, I shall concentrate on 
the issue of monitoring standards over a period of years, an issue that re- 
quires solutions to some interesting technical problems. 

The first problem appears on the surface to be straightforward; we 
cannot use the same items in the tests year after year. One reason for this 
is that we are required to report results on an annual basis, and it is 
regarded as essential that we publish at least some of the test items in 
each of these reports so that the general public will understand and be 
able to interpret the pci Tormance statistics that we give. But once an item 
has been published in a report it becomes accessible to teachers who wish 
their pupils to perform well. As a result it may be expected that it will 
receive special treatment in a substantial proportion of classrooms and 
thus will no longer function as a good indicator of achievement across 
the curriculum. A more subtle difficulty inherent in the continued use of 
the same test items is that changes in curriculum do occur, albeit fairly 
slowly. Test items, that seemed entirely appropriate when the first 
mathematics tests were put together in 1978, may well seem much less ap- 
propriate for use in 1988. Our commitment is to test each year what is 
currently being taught in the schools. Our tests are supposed to remain 
up to date and the deliberate and repetitive re-use of the same items, 
although it would facilitate the comparison of test scores between years, 
would also guarantee that the tests steadily lost validity. 

At one stage our then Secretary of State for Education was reported to 
bre extremely sceptical about the existence of any satisfactory alternative. 
She felt that comparisons of performance from year to year would lack 
credibility with the general public unless they were based on the same test 
items. But, of course, there are ways to handle this problem. One fairly 
neat procedure is that adopted for the Scholastic Aptitute Test in the 
United Slates, wherein each test carries a small section which does not 
contribute to the reported score, but which contains items that appeared 
in the preceding year's tests. 1 his provides a basis for equating standards 
from one year to the next using regression techniques. Thus a reported 
verbal aptitude score of 500 in 1980 should represent the same standard 
of performance as a score of 5(K) on the 1979 test, and this score of 500 
was itself equated to a score of 500 in 1978. The year-to-year linking, 
always on the basis of current test material, is probably sound over a 
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small number ot years during which the population of test candidates 
does not change too dramatically. One suspects though that quite 
substantial errors may well accumulate over a period of 10 to 20 years, 
which was the sort of time scale at which the APU was aiming. 

We preferred to approach the problem by using a bank of itpms scaled 
along the appropriate latent trait. To achieve a wide coverage of the cur- 
riculum we used some 600 items in the first survey of mathematics at age 
ten although no individual child was asked to attempt as many as 50. 
f rom the 600, we were able to discard a proportion the next year because 
I hey had been used as examples in our reporting, because they had 
proved not to have very good measurement characteristics or because 
they were judged no longer very relevant to what was being taught, (On 
the second cycle, this last category was largely non-existent, but we ex- 
pect the number of items involved to grow as the years go by.) The 
discarded items are replaced with new items developed by a panel of 
pra^iising teachers and curriculum experts, so that the next year's testing 
has a substantial overlap with the preceding year, but it has been changed 
surticiently to bring it up to date. The new items, Together with the sur- 
vivors trom the previous year, are scaled back on to the original latent 
trait scale, so that wc feel we can make valid comparisons of standards 
f rom one year to the next. Further, since we 'V-^n keep a careful watch on 
the performance of individual items over a number of years in order to 
ensure that they are performing in a consistent way, we feel more eonfi- 
dcni of our ability to make comparisons over a len-year-period, or even 
longer if necessary. 

I his brings us to the second problem 1 vvould like to consider, the 
question of what to report. A simple statement to the efleci that stan- 
dards have gone up or down by so many points since the year before is 

. unlikely to be of any help to anybody (except perhaps certain 
politicians). The important findings are tied up in the ways teachers and 
children arc reacting to a changing curriculum, to the changing emphases 
being placed on topics vvithin that curriculum, and in the resulting 
changed pattern of performance. Our aim in future APU work must be 
U) iry lo quantify the changes in the pattern of performance so that they 
can be related to changes in curricular emphasis and, hopefully, to 
chiangcs in the country's perceived educational needs. 

II we arc to attempt this, then we arc forced to confront the reality that 
atlammeni in what we think of as a single school subject (be it 
mat hematics, science, or hnglish language) is in reality a multi- 
dimensional group of separate attainments. By this 1 do not mean to 
miplv that, lor example, attainment in geometry is uncorrelaied with 
computational accuracy, understanding of algebra, or skill at using* a 
slide rule, F herc is ample evidence from past research that these traits are 
(|uile highly correlated. 
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In an A?-dimensional space there are two ways of looking at this situa- 
tion. The first is ihe bundle of traits all pointing in roughly the same 
direction but each being assessed separately by its own set of items. The 
second is to take this A7-dimensional cigar and decompose it into ortho- 
gonal axes. Here the major axis is a sort of conglomerate ^general 
mathematics performance' and the minor axes represent not ^geometry' 
but rather *the way in which geometry is difl'erent from general 
mathematics'. Of the two approaches to analysis and reporting, I prefer 
the second for the following reasons. 

If we choose to work with discrete sub-tests in geometry, computa- 
tional skills, etc, each with its own latent trait scaling, we can link perfor- 
mance from year to year within a sub-trait without problem. We know 
however that, in time, particular traits will be emphasized more in the 
way in which school mathematics is taught, and others will be empha- 
sized less. This may be reflected in increases in performance on certain 
traits and reductions on the others, but the link between the two will be 
hard to establish. F urther it would be difficult to incorporate changes in 
the assumed structure of school mathematics; for instance, we may be re- 
quired to combine two sub-traits into one, or break one up to make two 
new ones. With the second approach to analysis, the unit is the individual 
item within the item bank. Each item will have a loading (i.e. a difficulty 
level) on the general mathematics scale, but also an indication of the ex- 
tent to which it measures one or more sub-traits such as geometry. In this 
case, the entire collection of Mive' items at any point in time defmes the 
eti'ective mathematics curriculum thai is being a.s.se.ssed. Uneven patterns 
of performance that result from the multi-dimensional nature of 
mathematics achievement show up in departures from the Rascli model. 
These dcpariurcs (or residuals once the model estimates have been sub- 
tracted from the data) can be analysed to provide the details of the 
A/-dimensional pattern of performance. Mead (1976), in the United 
States, has developed sonic useful pointers here. The results of the testing 
then, whether for an individual pupil or a large group of pupils, can be 
expressed iii terms of an overall level of performance in mathematics 
together with a profile showing relative areas of strength and weakness. 
If it is decided to rcdctinc the trait structure cither by combining existing 
traits or by splitting to develop new ones, then this is readily accom- 
plished merely by re-^classifying the items. Past data sets can then be re- 
anaKsed in terms of the new structure in order to look for evidence of 
change. 

In case all this seems rather abstruse, let mc try and summarize here. 
0\er the years 1 expect mathematics performance to change. Not only 
will its overall level move up or down when measured against the current 
requirement being placed on the school system, but also the very struc- 
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lure of ihc curriculum, and the paiiern of emphases placed on various 
component pans, will be continuously changing. Our ^/-dimensional 
cigar will be slowly changing its shape in ways in which it would be vir- 
tually impossible to predict. We want not only to be able to identify the 
existence of these changes, and their general direction, but also to quan- 
tify ihcm in real performance terms. Wc want to say not only whether 
performance in school mathematics is going up or down, but how the 
defmiiion of school mathematics as evidenced by the pattern of perfor- 
mance of pupils is responding to the changing needs of society. 

It is too early to say if we shall be able to achieve this. We are still in 
the process of establishing base lines from the initial surveys, but if you 
should feci inclined to invite me back in ten years time I may have 
something more dcfmite to report. 

ITEM BANKING 
I hal I leave this topic to the end does not imply any lack of importance. 
It is my belie! that the tuture of educational testing lies largely with item 
banks. They clearly will have a much wider application than just in na- 
tional monitoring programs, although such programs are currently pro- 
viding the substantial resources necessary for item bank development. I 
chose not, in this paper, to concentrate on the general issues surrounding 
item bank use, preferring to consider in some detail alternative measure- 
ment models, but most of the points I have discussed will have direct 
relevance to the item bank user. 

The great advantage of an item bank lies in its flexibility of operation. 
The simple and rapid construction of custom-tailored parallel tests to 
order long or short, hard or easy, wide or narrow, all with known 
psychometric properties and with calibration tables generated 
autoniaticaliy-can improve the quality and impact of educational 
testing everywhere. 

As some of you know, I am involved in trying to set up an inter- 
national network of item banking centres which will exchange technical 
know-how and actual test materials between dilVerent nations. Why are 
we trying lo do this? Fkcause to be really eireclivc item banks need to be 
large, and this jiieans they will be expensive to create from the beginning. 
Sharing malerials can save money. l-urther, the existence of inter- 
nationally agreed criteria and conventions for classifying items, for re- 
porting psychometric parameters and so on will greatly assist cross- 
national comparisons, evaluation, and accreditation". 

On (lie Muallcr scale, my particular hope is that item banking can be 
ciuMcIopcd to provide the classroom teacher with really good diagnostic 
msirumcnts to cJarily the learning difficulties of his or her pupils. In the 
past this has tended to he a t^eglecied area because of its inherent 
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diflk'ultics, but latent trait scaling coupled with item banking may pro- 
vide the answer. 
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REACTANT STATEMENT 

Glen A. Smith 



I welcome this opporiuniiy to contribute lo this invitational seminar, by 
commenting on Dr Bruce Choppin's interesting and practical paper. 
While this is a conference on measurement, it is directed to a particular 
field, education and psychology, and as such it should touch on practical 
aspects and examples. Dr Choppin's paper does this, while still being 
soundly based on theory. We have been shown glimpses of latent trait 
modelling in the ticld (or rather, classroom) and it is here I feel that the 
model should face its closest checks. Its utility must go beygnd keeping 
the mathematically inclined among us happy and debating. 

Dr Choppin raised the analogy of temperature as with accepted 
measurement properties, but let me take the analogy a little further. As 
practical measurers, we are more concerned with fuzzier traits, like *com- 
fori\ Temperature is certainly one dimension of comfort -all people are 
uncomfortable at very high and very low temperatures, if we were to use 
a thcrmon.eier, with its good metrical properties, to measure comfort, 
we would miss our mark, while getting data that fitted the Rasch model. 
Some people are comfortable at ZO^'C, others at 26''C (it probably would 
not be 'culture fair', in any case). We still need to be critically aware of 
the name trait distinction, as I am sure Dr Choppin is, but I see the pointy 
overlooked so often that it is worth reiterating. 

I am interested to hear that NFER is using a latent trait model to 
measure achievement -an area where I think it may be less applicable 
than others, for learning does presuppose exposure to the material 
tested, and this can differ between groups, e.g. schools, and give data not 
fitting the Rasch model while still truly measuring achievement. This 
(juesrion can be treated empirically, and perhaps the data being collected 
by Nl FR will show ifie assumptions made to be warranted. It should 
give a lest of the robustness of the model; if it uncritically fits any data, 
with deletions of the occasional item, we need to think deeply about what 
the model is giving us.- This is possibly more relevant to think about for 
diagnostic testing -an area that I think does not need Rasch modelling, 
<ind by its nature - identifying low achievement areas for in- 
dividuals - possibly does not hold the essential assumptions. 

Dr C hoppin gave details of several other interesting applications and 
extensions of the model which are in the exploratory stage, and I will be 
interested to follow their developments, especially the analysis of in- 
complete data maiiices with its exciting application to multiple raters, a 
common problem in assessment. I am also interested in hearing more 
about the index. /;,, />, , especially its sensitivity to deviations from the 
model. 
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I would finally like to reinforce Dr Choppin's comments on item bank- 
ing with its close ties to computerization of testing, something of high in- 
terest to me. I agree that the future of educational testing will be tied 
closely to item banking, and I hope that it can work worldwide, drawing 
internationally on the cxperiisc of NF'ER and like bodies. 





The Inipn/venient oj Measuretnenl tft tducatnttf and f\vcho((fi>y 
Edited bv Donald Spcarriii 
Copyright fj AC ER 1982 



4 



The Linear Logistic Test Model and its 
Application in Educational Research 



One of the central questions of educational research with regard to test 
data is the assessment of learning effects. Psychometric analyses based on 
the Rasch model (Rasch, 1966) avoid some pitfalls of applying classical 
test theory (cf. Fischer, 1974; Rost and Spada, 1978). But this approach, 
which results in the measurement of the variable *stud^nt ability* and its 
change over time and also the variable *item difficulty*, is still deficient in 
several ways. 

First, this approach gives no answer to the question of how differing 
problem difficulties can be explained from a cognitive psychological 
viewpoint. Second, it gives no analysis of how changes in individual 
ability are to be understood. In other words, a theory and method for 
analysing item difficulty and ability change in a psychological and educa- 
tional context is lacking. This paper demonstrates a way of overcoming 
these deficiencies. At the same time it should be emphasized that the 
main problem of applying the Rasch model and models based upon it is 
that they are at variance with the assumptions of several well-known 
psychological theories of learning and development. This will be 
demonstrated later in the paper. As yet, insufficient attention has been 
directed to these questions in English-speaking countries. 

In Austria and Germany in the early seventies, Rasch\s ideas were ex- 
tended by developing logistic models to go beyond the quantification of 
item difficulties and student abilities (Fischer, 1973; Scheiblechner, 1972; 
Spada and Scheiblechner, 1973). The intention was to represent explicitly 
the effects of thinking and learning processes by means of these models. 

In the following pages, one of these probabilistic test models will be 
described, the so-called Linear Logistic Test Model (LLTM). The LLTM 
was first discussed by Scheiblechner (1972). Cox (1970) proposed a 
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similar model l)ut without the explicit notion of inierindividual 
differences. Fischer (1973; 1974; 1976) studied the statistical properties of 
the LLTM and derived the estimation equations on which the com- 
puterized algorithm (cf. Fischer, 1974) is based; this was also used in our 
studies. Fischer (1978) gives a thorough review of the work done in this 
area. 

THE LINEAR LOGISTIC TEST MODEL (LLTM) 
The basic idea which led to this model is the following: 

1 The difficulty of the items of a test, e.g. an achievement test, is 
traced back to (^explained' by) the number and difficulty of the cognitive 
operations needed and used for their solution; and/or 

2 the effect of the conditions of teaching and learning (e.g. instruc- 
tional measures) with which the subjects were faced before taking the test 
(and possibly also of those conditions which were of relevance while the 
test was taken) are quantitatively assessed. 

I hc LLTM makes it possible to realize this idea by decomposing the 
item parameters into linear combinations of more elementary parameters 
corresponding to the difficulty of cognitive operations or to the efl'ect of 
instructional measures, etc. The estimation of the elementary parameters 
is based on the same principles as the estimation of the item parameters 
in the Rasch model, and the validity of the decomposition can be tested 
statistically by methods similar to those used for testing the fit of the 
Rasch model (cf. Andersen, 1973). 

The LLTM is a Rasch model with an additional marginal condition. 
Therefore the model is characterized by the following equation: 

pi^W.i)^ exp($.-a.) 

l+exp($.~a.) 

with a. = i:/, r/, + r ' • 

T he probability thai siudeni v solves item / correctly is represented, ac- 
cording to (he Rasch model, as a logistic function of two parameters, 
naniel> c., characteri/ing the ability of the student to solve problems of 
this kind, and characteri/ing the ditficufly of the item. But wfiat is 
denoted by and In the context of analysing the problem-solving 
process, the item parameter o, is seen as a linear function of the number 
and ditlicultv of ihe cognitive operations leading to a correct solution. 
[ hcrelorc, in this case,/, denotes the hypothetical frequency with which 
operation / is needed. The parameter characterizes the difficulty of 
operation y and c is a normalizing constant. In a later part of this paper 
we shall sec that in studies in educational evaluation the parameters ry, are 
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ot'icn iniroduced lo quantity the etfcci of instructional measures on the 
item ditficullies. 

Since the LITM is a Rasch model with a linear marginal condition, it 
shares many of its characteristic features with this model. The number of 
correct responses is a sufficient statistic for the ability parameter, just as 
in the Rasch model. The structural parameters t], are estimated by a set of 
conditional maximum likelihood equations, which do not include the 
ability parameters. The estimates of the parameters r/, are therefore 
Sample free', in ihe same sense as the item parameter estimates in the 
Rasch model. A precondition for an application of the LLTM is that the 
matrix which is a k (number of items) times m (number of 

elementary parameters) matrix, is (a) of rank m and (b) specihed before 
the estimation procedure. 

To provide detailed information about the advantages and the prob- 
lems of applying the LLTM, three diflerent empirical studies which were 
carried out at the Institute of Science Education at Kiel (West Germany) 
are summarized. In the first study the LLTM was used as a model of 
thinking and intellectual development in the area of balance scale tasks 
(Spada, 1976; Spada and Kluwe, 1980). In the second study theeflect of 
different instructional measures was estimated in connection with an in- 
structional unit on nuclear power plants (Spada, HotTmann, Luchi- 
Wraage, 1977). In the third study the LLTM was used to develop an in- 
structional unit on problems of ^recognizing f unctional relationships* i///^/ 
to assess its etlects (Haussler, 1978). 

The discussion of these investigations will show that there are prob- 
lems in the assumptions of the Rasch model itself. 

the: LI FM AS A MODEL OF THINKING AND 
INTL I.LEC TUAL DEVELOPMENT 

Striictiirul Assumptions 

In the first study u> be reported here, the development of the concept of 
proportion was investigated by means of the l.l-TM and a deterministic 
niodel of (jualitative change (Spada, 1976; Spada and Kluwe, 1980). Only 
the rcsuhs ot balance scale problems analysed by means of the LLTM 
will be discussed, f hese problems represent one form of proportional 
tasks. They have been frequently used by developmental psychologists 
since Inhelder and Piaget (1958). 

By studving the relevant Piaget ian literature and by observing children 
wlio solved balance scale tasks, hypotheses were deduced about the 
cognitive operations applied by children in reaching the correct solution, 
and a sample of balance scale tasks with specified task structures was 
constructed. I he term 'psychological structure' of a task denotes in this 





70 The Improvement of Measurement 

Table 1 Four (of Eight) Cognitive Operations Assumed to be Relevant 
for the Solution of the Balance Scale Tasks Used in the 
Investigation 

Operation 

1 Attention to and deduction from difl'crent amounts of weights 

2 Attention (o and deduction from different lengths of the lever arms 

3 Compensation of a change of the amount of one weight or the length of one 
lever arm in the same modality on the other side of the bar 

4 C ompensation of a change in the other modality on the other side of the bar 

context the type and number of cognitive operations which enable a 
person from a certain population to solve a task. 

Altogether a set of eight cognitive operations for solving balance scale 
items was defined. Table 1 shows four of these operations. 

It was attempted to present tasks whose solutions involved certain sub- 
sets of the eight postulated cognitive operations. Twenty-four tasks cor- 
responding to different combinations of these operations were con- 
structed, f igure 1 shows one of these tasks. It is supposed that operations^-., 
I and 4 are relevant in the .solution of this task. It was hypothesized that 
the student was thinking in the following way: because the weight on the 
left side is reduced, the bar will be unbafanced. To compensate for this 
change on the left side, the weight on the right side has to be moved in- 
wards. Analogously, other tasks were related to other combinations of 
operations. 

For every item /\ a vector of the task structure / was defined, con- 
sisting of ones and zeros, where a one at the y-th position denotes the 
presence of operation j in item / and a zero denotes its absence. 
The structure of the sample item referred to in Figure 1 is 
( 1 , 0, 0, 1,0, 0, 0, 0). The hypothetical structure of all items under study 
can be summarized in a task structure matri.\ \f„\, 

Quanlilalive Developmental Assumptions 

In terms of the LL.TM, the development of the ability to solve propor- 
tional tasks can he considered to take the following form. 

Developmental change is reflected in the LLTM as a purely quan- 
titative change in the student parameters Development then is com- 
parable to global learning. This type of learning leads to higher solution 
prohahilities for all tasks of one homogeneous class of problems. For all 
chiklrcn and adolescents under study (with regard to the investigation on 
balance scale tasks, the age range between 1 1 and 16 years) it is supposed, 
therefore, that the task structure remains constant and that the operation 
parameters are invariant. That is, correct solutions are assumed to be 
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left 



right 



left 



right 



The drawing shows a balance scale 

wiih weights. The weights are hung in 

such a way thai ihe scale is in 
equilibrium. 



Now the weight on the left-hand side 
of the bar is decreased. In order to 
keep the bar in equilibrium, the weight 
on the right-hand side of the bar must: 
stay in the same position. □ 
be hung further inward. □ 
be hung further outward. □ 
I do not know. □ 



Figure I A Balance Scale Problem Used in Che First Investigation 

It is hypothesized that Operation 1 and Operation 4 are ap- 
plied to obtain a correct answer (Adapted from Spada and 
Kluwe, 1980, Figure 1.1.) 



based on the same cognitive operations irrespective of age; the opera- 
tions are assumed to have a constant rank order with regard to difficulty. 

Figure 2 presents the functional relationship that is expected on the 
basts of the model equation between task solution probabilities and 
children's task solution ability, the latter corresponding to the 
developmental level. Given medium person ability, the probability of a 
orrect solution of the structurally most complex of the three tasks is 
very small. With higher person abilities, the solution probabilities of the 
three taskv4ipproach each other and become approximately one for very 
high values. Tlie functional relationship can be understood to represent a 
more precise versiofrof a model proposed by Flavell (1971) to describe 
the developmental change of^ntellectual abilities. 

Fmpirical Findings 

In the investigation reported here, a pencil and paper test was used 
(Spada, 1976). Twenty-four balance scale tasks were given. The sample 
included 949 male and female students from ages 1 1 to 16 attending Ger- 
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Ability 

corresponding with 
developmental level 

The Functional Relationship Postulated in the LLTM be- 
tween Task Solution Probability and Person Ability (or 
'Developmental Level*) 

The item characteristic curves of three tasks with the follow- 
ing task structure vectors/ - (01 000000),/ = (01 100000) and 
/ -(01 1(X)I(X)) are shown. The abscissa distances between the 
three item characteristic curves -which are parallel to one 
another- were computed from the data and reflect the ditVer- 
ing task difficulties. (Reprinted from Spada and Kluwe, 1980, 
Figure 1.4) 



man secondary schools. The test was administered in classrooms. (In 
another investigation real balance scales were used in individual sessions 
(Spada and Kluwe, 1980)). 

Conditional likelihood ratio tests were used to test the assumption of 
sample tree parameter estimates and thus the validity of the Rasch part 
of the LLTM. Some of the tests showed significant divergences between 
the estimates of the item parameters computed from the daia'derived 
f rom difl'erent groups of students. It must be emphasized, however, that 
the parameter estimates did not differ widely and that the statistical tests 
indicated significant divergences essentially because of the large sample 
of subjects (significant results would not have been obtained with less 
than approximately 670 subjects). There is the question, nevertheless, of 
what shortcomings might be responsible for these results of testing the 
Rasch part of the Ll.TM. 
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!n the next step, the operation ditliculty parameters t}, were estimated. 
Based on these parameter estimates and the task structure hypotheses/,, 
estimates of the item parameters a, were computed and compared with 
those item parameter estimates resulting tVom an application of the 
parameter estimation algorithm of the Rasch model itse4l\ A graphical 
comparison indicated a good correspondence between the two different 
sets of parameter estimates. However, a conditional likelihood ratio test 
of the linear marginal condition of the ? .LTM showed that the differences 
were significant . This meant that at least some of the task structure 
[hypotheses were not valid and/or that the formalization of the 
hypotheses by means of the linear logistic model was not without 
problems, 

Quite another apprcn'ch to testing the validity of LLTM results on 
operation difficulties w.ts taken by Nahrer (1977, cf, also Fischer, 1978), 
He constructed new items for some of the tests which had been analysed 
bv nicaris of the I I TM in various research studies. Nahrer predicted the 
ditliculty of the new items by using the published results on the estimates 
ol the operation difficulties and the task structure hypotheses. The new 
items were [hen given to a new sample of subjects and the item difficulty 
parameters were estimated from these data by means of the Rasch 
model, Nahrer reports a good correspondence between both groups of 
parameter estimates in most of the cases, especially for tasks from the 
field of mechanics. e,g. rotation mechanism problems (Spada, 1977), It 
was possible to predict the difficulty of the newly constructed items quite 
well, although in this case also the tit of the LLTM was far from perfect. 
I'nfortunaiely no reanalysis was carried through for the balance scale 
problems discussed in this paper. 

Nahrer's study indicates another interesting field for applying the 
11 IM, namely the cimtrolled construction of items with predictable 
ditliculty, e.g. tor item banks and individualized testing (cf, Fischer and 
Pendl, 1980). The LI I M based on valid task structure hypotheses allows 
us to define in a precise way what might be understood by the notion of a 
'domain of tasks'. It is the homogeneous class of items which can be con- 
^tiuctcd on the basis of the set of analysed operations. 

ShorliMHTitn^s of the LLTM and the Ra.sch Model 

There are. of c(Hirse, aspects of ttie task structure hypotheses which are 
opcu to iiuesiion. I hev do not encompass all features of the problem 
st)l\jiig piDcess. Notb-ing is stated about sequential or temporal 
characteristics. E',ncoding7\lecoding, and memory features are virtually 
neglected in their present state of development. But the different tests of 
lit ol the I I IM have shown that it is necessary to discuss some of the 





74 The Improvement of Measurement 

assumptions of the structural part (the marginal condition) and the 
Rasch part of the LLTM in more detail. 

There is a serious drawback to the LLTM when applied in this context. 
The task solution probabilities cannot be understood as the products of 
the corresponding operation probabilities, because in general 



* h |/^P(^;;-^^) 1 ^" (2) 

I + exp (I - Z J\,y), + c) U ^ - r/,)J 



This contradicts the ^product rule\ which is a familiar assumption in 
probabilistic automata theory, as proposed by Suppes (1969) and ap- 
plied, for example, in the analysis of arithmetic teaching in elementary 
school, by Suppes and Morningstar (1972). The product rule states that 
the probability that a student solves an item correctly is the product of 
the probabilities that he carries out correctly all operations necessary for 
the solution (for a thorough discussion, see Spada, 1977). 

Equation (2) makes it clear that this problem is not' specific to the 
LLTM but also occurs in the Rasch model itself if that model is ap- 
plied -as is usual ~ at the level of item data and not at the level of opera- 
tion data. It may be that some of the negative results in applying the 
Rasch model in this study and in other investigations are due to the fact 
that the task solution probabilities cannot be understood as the products 
of the probabilities of a correct application of the corresponding opera- 
tions. Furthermore the assumption of a //Wear combination of the opera- 
tion parameters in the logistic function is rather arbitrary in the sense 
that it does not reflect psychological hypotheses about underlying 
cognitive processes. Statistical reasons were decisive in the choice of this 
type of function, since only the logistic function in the framework of the 
Rasch model involves the important advantage of 'sample-free' measure- 
ment. 

Another criticism of the LLTM can be made from the viewpoint of 
developmental psychology. The assumptions that the task structure is the 
same for all children In the sample and that developmental change can be 
represented as a quantitative change of the person parameters are at 
variance with numerous developmental findings. Again this criticism ap- 
plies equally to the LLTM and the Rasch model, if the models are used to 
analyse data from children of differing age levels. 

The balance scale tasks of this investigation are in some respects 
similar to those analysed in the experiments of Piaget and his co-workers 
(Inhelder and Piaget, 1958). In addition, their results inspired the for- 
mulation of the task hypotheses which were tested in this study. While 
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Piagei's theory assumes that the solution algorithms vary and become 
more and more complex in the course ot development, applications of 
tKc Rasch model and usually also of the LLTM are based on the assump- 
tion that correct solutions result from the same cognitive operations for 
ail individuals tested. 

In Piaget's theory, it is emphasized that qualitative structural changes 
take place in iniellectual development. In applications of the LLTM, 
inter- and intra-individual ditVerences are often understood as dilVerences 
in the degree of mastery of {\\c same solution algorithm, in contrast to 
developmental theories, where these diflerences are explained in terms of 
different solution algorithms. 

in the field of information processing theories, the work of Siegler 
(1976) IS relevant to the present discussion. In his rule asses' ment ap- 
proach, cognitive development is also characterized as the acquisition of 
increasingly powerful rules for solving problems. In li's extensive study 
ot problem solving with balance scale tasks, Siegler (1976) postulated 
tour ditlereiu algorithms, each algorithm corresponding to one 
developmental stage. His theory predicts for every stage a certain pattern 
ot correct and tal^e answers. These answer patterns are used to assess the 
developmental level (with regard to this class of tasks) of each child. One 
of tlie more interesting findings which substantiates Siegler's assumptions 
IS that cliildren moving from stage II to stage III show a striking decrease 
in the number of correct answers with regard to one class of items, the 
so-called contlict-weight-items. It is assumed that at tliis developmental 
level the children have learned to pay attention at the same time to bpth 
diniensit)ns of balance scale tasks, namelv weight and distance, but do 
not yet know exactly liow these variables are related. The consideration 
ot botii dimensions witliout exactly knowing how to Q^imbine them leads 
to an increase in the number of incorrect answers, because the sub-class 
of conllict-w eight-items was constructed in such a way that the items can 
be answered correctly (without full insight into the problem) by reference 
to weight alone. 

The Rasch model does not fit data,of this type, nor does the LLTM, if 
only one matrix of task structure hypotheses is assumed for all subjects 
of the sample studied (cf. May, 1979). In principle, it would be possible 
:o represent such structural changes (e.g. the acquisition of new se- 
quences of operations) in the LLTM. This could be done by specifying 
ditfereni task structure hypotheses for different children and for the same 
cliild at different stages of development or learning. A prerequisite would 
be to have well-founded hypotheses about such structural changes with 
regard to each subject. 

E.mpirical falsification of the Rasch model and the LLTM, which 
would be inevitable with this type of data, could also be avoided by ex- 
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eluding all tasks from the test sample whose difficulty does not decrease 
in a monotonic manner with age. This approach seems defensible from a 
diagnostic viewpoint. It is more problematic if the main interest is in a 
cognitive analysis of the developmental process. 

In summary, it can be said that structural changes of the problem solv- 
ing process caused by development or learning contradict the homo- 
geneity assumption of the Rasch model and the LLTM (with only one 
task structure matrix). If inter- or intra-individual differences result from 
such structural changes in a sample of subjects, deviations will be 
delected in graphical and statistical tests of the model. 

We refer fmally to two groups of psychological models of human 
knowledge, and of its acquisition, storage, and use. These relate to 
models based on semantic networks (Dorner, 1976; Norman and 
Rumelhart, 1975) and on production systems (Newell and Simon, 1972) 
(cf. also Anderson, 1976; Greeno, 1978). In these models cognitive pro- 
cesses are represented in such a way that the assessment of structural 
changes is also of special importance in the measurement of change. 

As a consequence we have to face the fact that the great majority of 
psychological developmental and learning theories postulate cognitive 
changes which would make the emergence of homogeneous item samples 
an exceptional, surprising result. In reality, however, the multiplicity and 
complexity of factors influencmg test behaviour often lead to a falsifica- 
tion of these theories and to a reasonable fit of the probabilistic Rasch 
model under appropriate item construction conditions. 

EDUC ATIONAL EVALUATION: EXPERIMENTAL 
DESIGNS WITH BINARY DATA 
In educational evaluation some parts of the learning fiistory of each in- 
dividual are usually known. The central aim of such investigations is 
often the assessment of the effects of different leaching strategies on the 
learning outcome. The most relevant type ot learning effects, which can 
be assessed by means of the LLTM and traced back to instructional fac- 
tors, are global learnmg effects. Global learning leading to higher solu- 
tion probabilities for all tasks of one homogeneous class can be 
represented in the model either as an increase ig the value of the person- 
parameters (i.e. individual abilities to solve tasks of this type correctly) 
or as a general decrease in the values of the item-parameters (i.e. item 
difficulties). For technical reasons we shall use the second form of 
representation of global learning effects. 

Let us consider the following very simple experimental educational 
design. An instructional unit is applied in two different variants in two 
samples of students. A third sample (control sample) does not receive 
this type of instruction. Each sample comprises about four classes. Of in- 
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icresl is ihe etVeci of each of the instructional methods on the improve- 
meni of some ability or skill of the students referred to in one of the 
learning objectives. This abihiy or skill is assessed by means of samples 
of items of one homogeneous item class given before instruction (pre- 
test) and after instruction (post-test). (For a detailed discussion of 
evaluation problems of this type, see Rost and Spada, 1978,) 

Equation (3) shows — for the general case of / tests (test I =:pre-test) — 
how the LLTM might be applied in problems of this type. The linear 
marginal condition is introduced to quantify the efl'ccts of the instruc- 
tional methods under study. The difficulty of the items of test / (e,g, the 
post-test) is traced back to the difficulty of the items of the pre-test 
(before instruction), to the etfects of the instructional methods and to a 
trend parameter, character i/ing non-instructional general efl'ects between 
the pre-test and test / on the'' ability under study. Table 2 illustrates for 
our simple example how the corresponding matrix f of the LLTM i.s set 
up. 

exp ~ a,, ^ 
M - V, /,/)=. '\ (3) 
1 +exp(f. -a.,^^,) 

with r;.,^^^ a, ^6,^^^ 

/;( + V, /, /) is the probability that student v solves task / correctly at test 
(time) /. 

^. is the ability parameter of student v. 

o . is the difficulty of item /' at lime / (te^i /) after that type and 

amount of leaching (i.e. instructional method in our 
example) that took place in the class with student v between 
test 1 and test /. 

o.. is (he (hypothetical) difficulty of item / before instruction 

(test 1). 

0 characterizes the erteci of that type and amount of teaching 

(i.e. instructional method in our example) that took place in 
the class with student v between lest I and test / (with b\ -0, 
/ I . . . pre-test). 

r), characterizes the efl'ect of instructional variant a(a- 1 , .v) on 

the ability. 

/. . denotes the amount of teaching method betwec . est 1 and 

test / (in our example, 1 denotes that the met' .as used 
with student v, 0 that it was not used). 

7 characterizes non-instructional general etVects between test 1 

and lest / on the ability, and is a trend parameter. 
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In the example, 

^ t , ~ r;i ^ T, in the case of instructional variant I , and 
r)2 f T. in the case of instructional variant 2, and 
T with no instruction. 

Based on this approach, it is possible to compute ^sample free' 
estimates of the effect paratncters of the instructional methods and of the 
trend parameter, even in the absence of random sampling with approxi- 
mately equal ability distributions in the difterent sub-samples of 
students. This property is of great importance in our example and in 
general in educational evaluation, because instructional methods are 
usually (esied with classroom groups, that is, the analysis is based on 
cluster sampling. Cluster sampling leads to an underestimation of error 
variance in analyses of variance and thus to an overestimation of the 
statistical significance uf the instructional effects. Using the LLTM, the 
sigiuticance of the effects and of differences between them can be tested 
statistically by means of conditional likelihood ratio te.sts. These tests do 
not rely upon the variance of the ability parameters; they are conditional 
tests, in which the ability parameters do not even enter. 

The use of the LI TM in this context has the additional advantage 
(common to most of the Rasch model applications) that it is not 
necessary to give the same test at the different points in time. Provided 
that some items are given repeatedly, the other test items can be selected 
in such a way that their dilliculty is adapted to the achievement level of 
the students at the time the test is given. If one is not interested in t^^e 
general trend elVeci, but only in the question of the differences of me 
et^ect^ of the instructional methods, it is even possible to present different 
Item samples in the different tests, as long as all items measure the same 
ability, thus meeting the assumption of homogeneity. 

1 fll^ approach uas also used by Spada, Hoffmann, and l.ucht-Wraage 
(19'^"') 10 evaluate the elfects of an instructional unit and of four ad- 
tlitional instructional measures. Tfie instructional unit was entitled 
'Nuclear flower FMants — Dream or Nightmare?'. A Raseh-scaled situa- 
tion test was developed (cf. Spada and Lucht-Wraage,. 1980), which 
made it pv)ssihlc lo assess several variables simultaneously, which cor- 
responded with the objectives of the instructional unit. The four groups 
of instructional measures were: introduction of a model person, activa- 
tion of alarm in subjects, the stabilization of attitudes, and group 
insiruciion on (he basis of interaction structure analyses, 

I he usual piocedurc in empirical curriculum development is to lump 
all good ideas together and (o measure their joint effect by evaluating the 
resulting instructional unit. In this particular classroom experiment, a 
ditlcrcni <ipproach wa^ chosen. With the traditional ideal of a factorial 
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design in mind, ihe four mcasJrcs were combined. Each of the 16 com- 
binaiions + + ) was then applied in 

ai least one class. 

The Ll.TM was used to analyse the data for all students in the 22 
classes involved in the experiment and to estimate the eft'ectiveness of 
each instructional measure, the general effects of the instructional unit, 
and the item difficulties and student abilities for each of the variables of 
the situation test. The results of the study cannot be presented here 
because of space limitations, but are available elsewhere (Spada, 
Hoffmann, and Lucht-Wraage, 1977). 

Mention must be made of two serious drawbacks in applying the 
IT/FM in this way in the context of educational evaluation. 

1 In contrast with the application of multivariate analysis of variance, 
no simultaneous analysis of all dependent variables is possible. 

2 It has to be assumed that the learning effects are not person-specific 
but depend only on type and amount of teaching. In other words, the 
model is only valid if all inter-individual diflerences in learning outcome 
which are not attributable to ability prior to instruction can be traced 
back to global eflects of the individual instruction or learning histories of 
the students in the course of the educational experiment, 'i his assump- 
tion of global learning, which refers to the level of the parameters 
employed and not directly to the reactions of students, is quite restrictive 
and should be tested when the LLTM is applied in this manner. 

Fischer (1977; 1978) referred to another problem in connection with 
the use of the LLTM and the Rasch model, namely, tlie restrictive 
assumption of item homogeneity, and developed similar logistic models, 
the 1 inear l ogistic Models with Relaxed Assumptions (LI RA), which 
arc not based on this assumption. 



A study undertaken by Haussler will serve as a final example of an ap- 
plication of the l.l.TM in an educational context. 

Haussler (1977: 1978; 1981) developed and evaluated two different 
teaching programs to improve the ability of adolescents to solve tasks of 
the type ^recognizing functional relationships*. He made use of both th; 
structural aspect of basing teaching on task structure hypotheses and the 
assessment aspect. Some of the difVerent ways of applying the Ll.TM, as 
discussed in (lie preceding pages, were therefore combined in his 
investigation. 

Haussler's investigation was based on 356 students in a pilot study and 
10.17 students in the main study, aged 12 to 16. He used the LLTM firstly 
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U) describe by means ol task stnieiiire hypotheses the eonstiiaents of ihe 
soliiiion algorithms used by the studcnls lo solve the tasks, and secondly 
to measure the etl'ecis of the teaching programs. 

I fie task siruciLire hypotheses were deduced by observing and inter- 
viewing siudeiils solving such problems and by considering some of the 
conceptions of Scandura (1973). The hypotheses were then tested by 
means of ilie I 1 TM. In line with the preceding discussion of short- 
comings of the I I.T M as a mode! of thinking, it is not surprising that the 
fit of the 1 L IM (and of the Rasch model) again proved to be not 
saiisiaciory in some cases. On the other hand the contribution of the task 
structure analyses in producing a basis for psychologically well-founded 
teaching methods was substantial. 

Those problem solving operations vvlrich were used by students to 
solve individual problems correctly, before any training in this special 
held was provided, were denoted by Haussler as 'spontaneous* 
algoriiluns, fhc first teaching program was based on these spontaneous 
aleoMihms. All ol the algorithms identified by the task structure analyses 
fiave one procedure in common: they involve manipulation of the data in 
such a v\a> ihai an invariant quantity is produced. This common pro- 
cedure v\as used to synihesi/e algorithms which include many of the 
spontaneous algoriifmis as special cases (cf. Haussler, 1978). As part of 
I fie second teaching program these 'synthetic* algorithms were taught; 
thcv are more comprehensive, theoretically superior, higher-order 
algorithms. 

Hausslci (19"'S. 1981) ascertained that both programs yielded 
siaiisticallv signiticani. substantial and relatively long-lasting positive 
etlecis. f igurc 3 summari/es the results of the estimation of the etfccts of 
(he two leaching prtmratns. 1 he LI. FM was used to estimate 2 (teaching 
pri>grams) ^ 4 (ptMtiis in time of testing) x 3 (subsamples of tasks) - 24 in- 
si ruction cHcct parameters h (cf. Equation 3). 

Ihe students were tested prior to instruction (T\), immediately after 
instruction ( /:). and si\ weeks after instruction (Ty) in either teaching 
program A (sponianetuis algorithms; or teaching program B (synthetic 
algorit htns). Some students were given a short refresher program after 
(he MX ueck peiitKl; these are designated (T^) in place of ('/\). Three 
grcHips o\ ta-.ks v\erc given in the four testing phases. Group I tasks were 
Used during instruction to practise the different algoriilmis. Group 2 
tasks ciHild be solved by an algorithm similar to the one learned during 
insiruciiiMi. dnnip 3 tasks could be solved only by inventing a new 
aleoriihm. A riu^st interesting result was that the teaching of the spon- 
taneous algorithms lurncdoui \o be more elfcctive with Group 3 tasks. 
Presumahl>. (his strategy had forced the students at the outset to con- 
sider (lie pi^ssibihty of being confronted with new problems and to 
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Figure 3 Graphical Representation of instruction Effects 6, 

The solid line corresponds to leaching program A (spon- 
taneous algorithms), the broken line to teaching program B 
(synthetic algorithms). Significant differences between A and 
B are marked with a dot. (Reprinted from Haussler, 1978, 
Figure 6.) 



develop solution algorithms on their own — along the lines of the learned 
spontaneous algorithms — in order to cope with these problems. The 
presence of interactions between teaching effects and certain subsamples 
of tasks draws attention to the restrictiveness of the usual ar^^^umption in 
applying the LLTM and the Rasch model, namely that learning elVewts 
arc postulated to be constant for all items. 

SOME CONCLUDING REMARKS 

After twenty years of development and discussion the logistic models 
originating in the basic ideas of Rasch are largely accepted as valuable 
tools in educational measurement. In recent years, however, doubts have 
been expressed by several authors about the validity of these models as 
psychological models of cognitive processes in learning and develop- 
ment. In this paper we have tried to consider questions of educational 
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mcasurcmcni and of psychological ihcori/ing simultaneously. We have 
dcmonstraicd thai the LLTM, as one of ihese logistic models, and the 
Rasch model itself cannot provide a completely acceptable basis for 
educational measurement If the various critical psychological arguments 
arc taken seriously. 

Nevertheless the LLTM seems to be an Interesting tool in cognitive 
research and educational evaluation, because It makes it possible both to 
measure Inter- and Intra-lndivldual differences and at the same time to 
analyse general regularities which are often hidden behind these 
diflcrcnces. In practice, this statement holds only If the restrictive 
assumptions of the model are not falsified by the data under study> As a 
consequence, the LLTM and the Rasch model should be applied only in 
those cases in which the valldltv of their assumptions Is plausible and is 
tested sufficiently. It is our hope thai this paper has added some insight to 
a more restricted and better controlled use of the LLTM and the Rasch 
uivkIcL 
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RHACTANT STATEMENT 

G/enn Row/ey 



There are two things on which I want to comment. One is the title. of the 
paper, and the other is the paper itselT. The title I have had for some time 
and the paper for very little, so 1 may give more considered comments on 
the title of the paper than on its content. Another point I would like to 
note is that there has been an unannounced change in the title. The pro- 
gram leads us to expect a paper on The Linear Logistic Test Model and 
its Application to Educational Evaluation', The paper we have before us 
has ^evaluation' replaced in its title by 'research', I am not sure whether 
tlie alteration was inadvertent or resulted from a change of plans, but it 
does change one's expectations quite dramatically. While it ,seems to me 
that the use of latent trait models in evaluation is an area that, if it proves 
to be successful, is going to be very important for evaluators, it is also an 
area in which there are many problems. So I want to begin by making a 
couple of points about evaluation, and about how it differs from 
research, 

Firstly, as distinct from research, evaluation is an activity which is, or I 
would argue should be, conducted by or in conjunction with teachers, 
for the benefit of teachers, and ultimately for pupils, I am not sure that 
that is always the case wiih research, and I am not even sure that it 
should be, I am quite sure thai evaluation is not always conducted in that 
way, and perhaps this has some implication for the kinds of measure- 
ment that wc can make use of in evaluation. Barry McGaw introduced 
the analogy of the sailing ship, and 1 think it can be taken much further, 
Ii docs seem to me that wc have to live with the fact that what we are 
doing in taculties and schools of education throughout this country is 
sending our teachers out, not in battleships or even in sailing ships, but in 
row boats without oars. We rarely provide our teachers with enough 
training to cope adequately with the demands made on them by tra- 
ditional norm-referenced measurement procedures. Our teachers usually 
know little about criterion-referenced measurement (except, frequently, 
that thcv are in favour oi it), and I seriously wonder whether measure- 
ment based on latent trait models can yield results which are meaningful 
to the practitioner. Yet, if measurement is to have an impact on practice, 
it must yield results which are meaiYingful to practitioners, for they are 
the people for whom wc need to provide assistance and with whom we 
need to conmiunicate. 

At the same time, Harry McClaw, I thought, dismissed critcrion- 
rcterenced measurement rather too easily. I do not think that we can 
dismiss criterion-referenced measurement at all. It may disappear in the 
sense that measurement specialists lose interest in it and turn their attcn- 
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tion in other directions; but it w iH not disappear in the sense that teachers 
are going to keep right on using it. Perhaps they may not call it criterion- 
referenced measurement, but if, as a teacher, 1 have taught a given area 
of content or towards a given set of objectives, the first thing I will want 
to find out is whether my students can do certain things. No matter 
whether the measurement experts are ihere to help me or not, that is what 
I will be trying to do as a teacher, and that is what my testing will be 
directed towards. So criterion-referenced measurement will not dis- 
appear, even though we may not help the teachers as much as we should. 

Secondly, there is an assumption involved in all latent trait models 
which is fine for measurement and for psychology, but which gets us into 
trouble when we try to use the models for evaluation. This is the assump- 
tion that Professor Thorndike spelt out, to the effect that we are never in- 
terested in the behaviours themselves, but only in the underlying trait 
which those behaviours represent. That very same point of view is ex- 
pressed at more length in a piece from a test manual published by Educa- 
tional Tesiing Service. This is not latent trait modelling; this is traditional 
classical measurement', as practised in the bastion of norm-referenced 
measurement, circa 1969: 

When \vc usc a test are measuring indirectly by taking a series of ^readings* 
(one for each test quesiion), no( of the characteristic that we are trying to 
measure, but rather various indicators of that characteristic. Then v\e must try 
to infer something about the characteristic itself from the indicators \ve have 
collected. In a Aav a test is likr radar, where observations of a series of 'blips' 
on a screen are used to infer various characteristics of some unseen object. 
{SCA T'S! LI' Series !! Teachers Handbook, 1969) 

I want to say that when 1 am evaluanng programs, or when I am 
evaluating my own teaching, I am very interested in those blips. I want to 
know whether my students can do these particular tasks and those par- 
ticular tasks, and if they cannot, what can be done about it. So, from my 
point of view, the blips on the screen are very important. If the blips on 
the screen enable me to develop scores which measure an underlying trait 
that I have managed to increase, or at least stop from decreasing, then 
that is an added bonus. But I remain very interested in the blips on the 
screen, and I am not at all ready to dismiss them. It seems to me that, in 
evaluation, very often teachers want to know in what areas they have 
succeeded and in what areas they have failed, and the notion of a single 
latent trait may perhaps not have so much appeal. Certainly I would say 
that the measurement of a single latent trait has much more appeal to 
researchers than to evaluators, 

A third feature of evaluation is that very often the evaluator is in- 
terested in the performance of a group rather than in isolating the per- 
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tormance of a single individual. That, I think, makes evaluation quite 
often a distinct acMvity from research which is more inclined to focus on 
individuals and how an individual tackles a problem. This is the focus of 
Professor SpadaN paper, and this is why I think it appropriate that the 
title refers to ^research' rather than ^evaluation*. In considering his paper, 
therefoiXN I am viewing it as a contribution to research (and I think it is 
an important One) rather than to evaluation. 

Firstly, it does seem-tQ^me that the notion (and this is, perhaps, an 
over-simplification of what has been done) of analysing item difficulties 
as a way of understanding the processesjnvolved in solving problems and 
in developing cognitive skills is a very important way of tackling those 
kinds of problems. What we have seen is a demonstration that item 
difficulties measured in the metric that the Rasch model provides can be 
analysed successfully in this way, and that this can lead to useful infor- 
mation and even understanding. What I would like people to think about 
exploring is whether item difficulties measured in traditional metric, or in 
other metrics that may be devised, can be treated in similar ways. It 
seems to me that we have a situation where we are interested in knowing 
what factors might make an item more difficult or less difficult, and the 
metric in which we measure item difficulty is something that people can 
legitimately diflfer about. I do not know which is the best metric in which 
to measure item difficulty for these purposes, although some of the pro- 
perties of the Rasch model may make it particularly advantageous. 

In opening the paper. Professor Spada made a comment about the 
measurement of change. He said psychometric analyses of such data 
based on the Rasch model avoid the many pitfalls of applying classical 
test theory. I hbpe they do, but I am not yet convinced. Every time I 
listen to its advocates talking about the Ra.sch model, I have to keep 
reminding myself that one of the nice properties of the model is that the 
total score is a sufficient statistic for estimating ability. Another way of 
putting that is that the ability.estimates that you finish up with are really 
a transformation of the total iicore — of the number of items answered 
correctly — and therefore the error of measurement associated with those 
will be carried along intact through the transformation. Errors of 
measurement do not go away when you use a latent trait model. Some of 
the major problems of measurement of change come about partly 
because errors get confounded with one another, and partly because the 
errors loom very large in comparison with the amount of change which 
has taken place. Applying transformations of one kind or another does 
not remove that problem — it is still going to be there, and I do not know 
any way of overcoming it. There are also problems of metric when we 
measure change. One thing that we cannot often do is equate a change of 
so many points at one part of a scale to a change of so many points at 
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another part of the scale. Given that the use of the Rasch model cor- 
responds to a transformation of metric from one to another, it may well 
be that those problems are at least reduced, if not eliminated, but 1 have 
never seen it argued that this is so. 

1 suppose the other question that 1 want to raise • ^^J*ther the sort of 
research that has been described could be tackled in other wajs, and 
whether other ways are better — w hether, for instance, the same questions 
could be addressed via variance component analysis on item difficulty in- 
dices, be they Rasch model or classical or whatever. Are item difficulties 
affected by this or that instructional treatment, by this or that 
characteristic of the item? There are other ways of asking those questions 
which may lead to better or worse answers. By what criteria do we judge 
them to be better or worse answers? What 1 have tried to do here is to 
raise a number of questions which may be taken up, if of interest, ur 
followed up in quite different directions if that seems more appropriate. 
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Using Latent Trait Measurement 
Models to Analyse Attitudinal Data: 
A Synthesis of Viewpoints 

David And rich 

INTRODUCTION 
I have chosen to demonstrate how a Rasch latent trait model synthesizes 
two common approaches to attityde measurement. There are two 
reasons for this choice. Firstly, because the two approaches to be con- 
sidered appeared in the literature around 1930, the time the Australian 
Council for Educational Research was founded, a presentation with 
some historical flavour seemed appropriate. Secondly, Dr Keeves's state- 
ment on objectives for this conference commenced as follows: 

During the past two decades there has been much effort expended by 
psychometricians in the development and perfection of latent trait 
measurement models. Yet it is only within the last five years or so that 
measurement procedures based upon these models have begun to make 
an appreciable impact on the practice of educational and psychological 
measurement in Australia. A few practitioners in these disciplines have 
become acquainted with these procedures, but most still remain un- 
acquainted with the features of the various latent trait models. Conse- 
quently, for the most part, traditional measurement procedures, 
developed during the first half of this century, are slill being used. 

Therefore it seemed that exphcit connections to familiar traditions would 
make the material less esoteric. The price for this apparent advantage is 
that sometimes characteristics of the more familiar approaches have to 
be rearranged and viewed from a somewhat different perspective. 

Afler a brief review of the two traditional approaches, the main model 
of the paper, the Rasch model for ordered response categories which is 
called the rating response model, is presented. Because it has been 
presented elsewhere in more detail, this exposition is relatively brief. The 

89 

ERJC 



90 



The Improvement of Measurement 



next section shows that the main features of the two traditional ap- 
proaches, both theoretical and practical, are also covered by the Rasch 
rating model. 

The development of this model, particularly with its emphasis on the 
explicit elimination of parameters in what is called the Rasch tradition of 
model construction, is traced. It is argued here that, without this perspec- 
tive of parameter elimination, the model is most unlikely to have been 
constructed. This section also attempts to show that the rating model can 
retain characteristics which usually are seen as mutually exclusive to the 
two approaches because it is set in a framework apart from the other 
two. To help make this point, some further less obvious but no less im- 
portant connections with the established approaches are described. 
While it has not been developed explicitly in that way, readers may see 
illustrative glimpses of a paradigm shift, in the sense of Kuhn (1970), in 
such a presentation of the Rascn tradition. This is not coincidental. I did 
have an eye to Kuhn's thesis when structuring this paper. A brief sum- 
mary is then provided. 

THE THURSTONE AND LIKERT TRADITIONS 
FOR STUDYING ATTITUDES 
The following discussion on the relatively well-established frameworks 
for studying attitude is circumscribed in two ways. Firstly, it deals only 
with the two most common traditions, one associated with Thurstone 
which appeared formally in the late 1920s and the second associated 
with Likert which appeared in the early 1930s. Other approaches to data 
collection and its modelling, together with definitions of the concept of 
attitude, are covered in books such as Dawes (1972) and others. 
Secondly, only certain key characteristics of these traditions, which are 
well known but which set the relationships among traditions in context, 
are highhghted. These restrictions in scope are directed by the concern 
with basic principles in both traditions rather than with their detailed 
elaborations which can be found, for either or both, in text books such as 
Edwards (1957), Torgerson (1958), Oppenheim (1966), and Bock and 
Jones (1968). 

The work of Guttman (1954) is not considered here, partly because of 
space. and partly because it is not used as commonly as other traditions. 
However, it is conjectured that, with further developments, the main 
features of Guttman's formulation could be covered by the rating model. 

1 he Thurstone Tradition 

Rational Scales 

By analogy with studies in psychophysics, but with no reference to a 
physical continuum, Thurstone (1927a, 1927b) defined the concept of a 
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discriminal process when a person reacts to a statement and formulated 
the law of comparative judgment. The application of this law associates 
with each statement a real number, called the affective value, which in- 
dicates the relative degree of a particular affect the statement arouses. 
Although no physical continuum was available, the notion of a proper 
linear scale with well-defined intervals was not abandoned; indeed it was 
stressed (Thurstone, 1928). Consequently emphasis was placed on both 
the evidence that the statements could be placed on a single continuum 
and on estimating the relative affective values of the statements. A collec- 
tion of statements conforming to a linear continuum was taken to define 
a National scale'. This term will be used throughout for a linear scale with 
interval or additive properties. 

The Pair Comparison Design 

The law of comparative judgment is based on the design of 'pair com- 
parisons', in which persons compare statements with respect to their in- 
tensity relative to some particular attitude variable. (Thurstone 
developed applications of this design primarily in terms of social values 
and predictions of choice, but it can equally well be applied to attitude 
statements. Thurstone (1928) describes an alternate method for scaling 
statements for purposes of attitude measurement of individuals. This is 
discussed later in this section.) This law is generally formulated as 
follows (Thurstone, 1927a; Bock and Jones, 1968): 

I On encountering statement /, a randomly selected person from a 
population perceives it to have a real value d, on the aflective scale which, 
over a population of persons, may be defined by 



where is the hypothesized scale value of the statement and is an error 
component associated with the person. In the population of persons d, is 
a continuous random variable which is normally distributed with ex- 
pected value 6, and variance o}. 

2 In comparing statement / with statement y, the person reports state- 
ment / to have the greater value if d,-d,>0. In the population, this 
difference 



is a continuous random variable, normally distributed, with expected 
value 



^/. = 6. + f. 



(I) 



d.,^d.-d,^{d,-d,)-^(t.~(,) 



(2) 



E\d„\^6,-h, 



and variance 



y\dj = ol=^a} 



+ aj - 2Q,,a,a, 



(4) 
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Figure la Probability that ry,v >0 for Fixed (6,-6>) in a Pair Com- 
parison Design 



This difTerence process for a fixed b,-bj is graphed in Figure la in which 
the shaded region represents the probability that d,j>0. In data, the pro- 
portion of persons who judge that statement / has a greater effect than 
statement y is an estimate of this probability and the estimate of 6. -6, is 
the corresponding normal deviate. The probability that d,j>0, as a func- 
tion of (6, - 6,), may be expressed as 

p\d„>0\d„ 6„ aJ^<i>\a.Ad^-d,)\ (5) 

where c!> is the cumulative normal distribution with mean zero and 
variance unity, and where a„=l/a,j is the discrimination. This prob- 
ability is graphed in Figure lb. 

The consequence of the assumption that a person is randomly selected 
from a population and that the inter-individual differences form part of 
the error is that, when estimated from a body of data, the relative scale 
values of items actually describe the population and indicate nothing 
specific about any individual person beyond what can be inferred from 
tlie person's membership of that population. For example, Thurstone 
(1927b) scaled social values with respect to criminal offences for a par- 
ticular population of persons. If the scale values in another population 
had been different, it would have been inferred that the populations were 
different with respect to their opinions/regarding the offences. This issue 
is amplified later. 
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The pair comparison design, which has received a great deal of atten- 
tion in the literature both at a technical level (David, 1963; Bock and 
Jones, 1968; Davidson and Farquhart, 1976) and at a more theoretical 
and philosophical level (Bradley, 1976) has the drawback that it is ex> 
tremely time-consuming. As a result, models for incomplete designs 
(Bock and Jones, 1968, Ch, 7) have been described. Adaptations and ap- 
plications of the law of comparative judgment to the much simpler 
design of rank ordering from which dependent pair comparison can be 
inferred were also developed by Thurstone (1931), 

The Equal Appearing Interval Design 

With a .scale having proper interval level properties, it was a small step to 
realize that not only could populations be compared with respect to the 
nature of the scale they generated but that, if they generated the same 
scale, then the populations could also be compared for location and 
dispersion. In this case, the scale lakes on an additional characteristic, 
that of a measuring instrument. 

For the explicit purpose of constructing an attitude measuring instru- 
ment in which many statements had to be scaled, the somewhat diflerent 
design of 'equal appearing intervals', which has particular significance 
for this paper, was also developed by Thurstone (1928), In this design, 
people order a collection of statements, ol which some 20 are fmaliy re- 
quired, into a number of groups which they consider appear equally 
spaced on an affective continuum. 

The model for the classification, displayed in Figure 2, is a straight- 
forward adaptation of the one shown in Figure lb and again involves 
assuming that a continuous random variable d, is induced when a person 
encounters a statement. Then if 7i, 72, . . . r*, . . ., r,„ designate'the m 
boundaries or thresholds separating the m 4- 1 ordered categories or inter- 
vals, the response corresponds to the interval in which the value of the 
random variable falls. If 6, is again defined to be the affective value of 
statement /, then the generalization of (5) is given by 

pUi > 6. - T, i 5„ a, I = 01a.(6. - r,)|, (6) 

which is the probability that a randomly selected person will place the 
statement in or below a particular category. The estimate of each of these 
probabilities is the proportion of persons who classify the statement in or 
below a given category and, by transforming these estimates to normal 
deviates, the scale values of both the statements and the category boun- 
daries can be estimated (Edwards and Thurstone, 1952). 

^ This paper is a bibliogiaphy nt sonic 350 papers on the topic of pair comparisons. 
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Figure 2 Probability that a. is Less than 6. - t, for each n in the Equal 
Appearing interval Design or Probability that is Less than 
Ik in Likert-style Statement 

It becomes evident from (6) that, in the equal appearing interval 
design, the affective value of each statement is compared with the 
thresholds or category boundaries. This contrasts with the pair com- 
parison design where the affective values of two statements are compared 
with each other. Notice that no random variation is associated with the 
thresholds, but only with the statement. 

With a further rearrangement and redefinition so that a person v has 
his ability parameter compared with the item difficulty 5. (Lumsden, 
1977), equation (6) becomes 



This model has been extensively studied for dichotomous responses to 
achievement test items (Lord, 1952; Kolakowski and Bock, 1970). 

There are three aspects of this extended Thurstone framework that are 
especially relevant here. Firstly, Thurstone stressed the importance of the 
invariance of the scale with respect to people to be measured and made a 
clear distinction between the construction of a scale and its use for 
measurement as tbllows: 

It will be noticed that \hQ constructiomwd iho app/ication o\' d scale for 
measuring attitude are two ditferent tasks. If the scale is to be regarded 
as valid, the scale values of the statements should not be alVected by the 
opinions of the people who help construct it. This may turn out to be a 
severe test in practice, but the scaling method must stand such a test 
before i( can be accepted as being more than a description o^ the people 
who construct the scale. (Thurstonef, 1928; 1959, p. 228) 

' \tanv of ihurstonc's papers arc reproduced in Thurstone (1959). 



P\d,.,>f3. - 6, 1/?., 5., a,\ - - 6.)] 



(7) 
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Thurstone described how these assumptions can be tested empirically by 
taking persons of known different attitude and comparing the relative 
scale values of statements obtained from the two groups. Secondly, he 
also recognized the complementary requirement that a person's measure 
should be independent of specific statements in the set. With re.spect to 
an achievement testing situation, he made this point as follows: 

It should be possible to omit several test questions at different levels of 
the scale without affecting the individual score. (Thurstone, 1926, p. 
446) 

Thirdly, while recognizing the importance of person-attitudes, both as 
a possible source of contamination in scale construction, and for the 
comparison of two groups with respect to location and dispersion, 
Thurstone never explicitly formalized a person-effect. Consequently, 
the procedure for ^measuring' persons with a scaie, though practical and 
sensible, was essentially ad hoc. The procedure is to ask the person either 
to agree or disagree with each statement in the scale and then the 
measurement is taken to be either the mean or the median of the scale 
values of statements the person endorses. To establish invariance over 
statements, the statements must be equally spaced on the continuum. 

The I Jkert Tradition 

Data Collection and Scoring 

Perhaps the most popular design for studying attitude is that from Likert 
(1932) in which peisons respond directly to statements by indicating the 
degree of intensity with which they approve or disapprove of them. It is 
the same design as that of Thurstone for attitude measurement but, in- 
stead of simple endorsement or rejection, the responses have degrees of 
endorsement or rejection. To distinguish it from the pair-comparison 
and other data collection designs, this will be called the *direct-respoiise* 
design. 

The basic response set of Likert i.s fStrongly Approve, Approve, 
Undecided, Disapprove, Strongly Disapprove), though variations, most 
of which also have five categories, are readily devised; for example, 
another set is lAKvays, Often, Sometimes, Seldom, Neverj. The state- 
ments are similar lo those constructed by Thurstone, but they are not 
first scaled. Instead the ordered categories are simply scored with suc- 
cessive integers and a person\s attitude value is taken as the sum of the 
scores of all statements. 

This approach is popular because it is simple, it focuses directly on the 
attitude of persons, empirical researchers find it satisfactory, and it is 
theoretically undemanding. In explaining his own design, Likert wrote 
the following with respect to those of Thurstone: 
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A number of statistical assumptions are made in the application of his 
attitude scales, e,g, that the scale values of statements are independent 
of the attitude distribution of the readers who sort the statements, 
assumptions which, as Thurstone points out, have not been verified. 
The method is, moreover, exceedingly laborious. It seems legitimate to 
inquire whether it actually does its work belter than simpler scales 
which may be employed, and in the same breath to ask also whether it 
is not possible to construct equally reliable scales without making un- 
necessary statistical assumptions, (Liken, 1932, p, 7) 

Thus, while arriving directly at person-attitude as if obtained by a 
measuring instrument, Likert rejected the necessity, spelt out by 
Thurstone, that the instrument's operating properties be invari.ini across 
ditVcreni groups which are to be measured. He rejected, also, recourse to 
any formal statistical modelling of a response process, 

Likert did originally investigate derivation of empirical weights for the 
categories, not statements, and in doing so, assumed that the distribution 
of responses across categories was normal. As scale values for the 
categories, he also used the normal deviates corresponding to the 
cunuikiti\e distribution. It is interesting to note that Likert observed that 
the distribution of responses across categories was often skewed. He 
took this to be, primarily, a function of the attitude distribution of the 
group involved. For example, with respect to a statement which had the 
distribution [1, I, 3, 8, 87| in one group* he noted, that it had the less 
skewed distribution [4, 3, 17, 18, 58| in a group with attitudes known to 
be ditVereni from those in the first group \x\ this context he wrote: 

On the basis of this experimental cvidc^nce and upon the results of 
others, ... it seems justifiable for experimental piirpo.ses to assume 
that atiiiutles are distributed fairly normally and to u.se this assumption 
as the basis for combining the difiereni statements. The possible 
clangers inherent in this assumption are fully realised. This assumption 
is made simply as part of an experimental approach to attitude 
nieasurenieni. It is a step which it is hoped subsequent work in this field 
will either make unnecessary or prove justifiable. Perhaps this assump- 
tion is not correct; its correctness can best be determined by further ex- 
periment. (Liken, 1932, p. 22) 

There is no attempt here to define the population in which the distribu- 
tion is normally distributed but it is interesting to note firstly, that 
distributional assumptions among people were mentioned, and secondly, 
that Liken hoped these would prove unnecessary. It is also interesting 
that no mention was made of the effect of statement scale values on the 
distribution of responses among categories. 

Following such scaling of categories. Liken investigated the weighting 
ol' categories by successive integers. A comparison of scorc.^* of persons 
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obtained by a sum across items of the empirically determined weights, 
and those obtained by a simple sum of the integral weights, provided 
almost perfect correlations. On such evidence, Likert concluded that it 
was adequate to use the simpler weights. 

The procedures for checking the consistency of statements in evoking 
unidimensional responses is also less formal than in the Thurstone tra- 
dition and parallels traditional test theory (Gulliksen, 1950; Lord and 
Novick, 1968) to such issues. Thus correlations between person-scores on 
odd and even statements of a questionnaire and correlations between 
person-scores on a statement and on the total set of statements are 
employed. In addition a discrimination-type index, obtained by compar- 
ing the scores of a statement on two extreme groups defined by their total 
scores, is often calculated. These indices arc all formally much less 
rigorous than those for the pair-comparison and equal-appearing- 
interval designs which are exact probability statements regarding the 
quality of fit. 

Although the Likert format has proved extremely satisfactory, both in 
terms of easy application and traditional reliability criteria, two related 
issues continue to be questioned: firstly, the adequacy of integer scoring, 
and secondly, the correctness in considering the Undecided middle 
category to represent an attitude between Approve and Disapprove. The 
first pertains to the belief that scoring by successive integers depends 
upon equal distances between successive categories while the .second per- 
tains more to the question of unidimensionality. Thus it is considered 
that a person may respond to the Undecided category for reasons such as 
failure to understand the statement, indiflerence, or ignorance, as well as 
some kind of neutrality (Dubois and Burns, 1975). Therefore, because an 
expression of attitude is considered more informative, there is an in- 
tuitive appeal in constructing statements which do not attract responses 
in the middle category. However, that is the very reason for concern 
regarding the weighting of the middle category. 

Statistical Models 

In advances with latent trait theory, the threshold concept with respect to > 
ordered categories witfiin statements, as investigated by Likert, has been 
formalized following the cumulative normal ogive procedure of 
Thurstone outlined above' Samejima (1969) developed the mathematical 
machinery for the case of statements and persons while Kolakowski and 
Bock (1972) have written a computer program to execute data analyses. 
However^ these authors and others seemed to have concentrated on 
achievement items with more than one category rather than attitude 
items. Two further points in relation to this development need to be 
made. 
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Firsily, in achievement testing the threshold parameter is taken to be 
different on each item so that instead of the additive structure 5, - be- 
tween item and threshold parameters, there is a different threshold 
parameter for each item. In a Likert-style attitude questionnaire, thistv 
would correspond to Likert's original approach to scaling categories 
rather than statements. Secondly, these authors and others shift from the 
cumulative normal to the logistic distribution. This is done because in 
maximum likelihood parameter estimation, which tends to be used, the 
explicit logistic is far morie tractable than the implicit normal. For the 
pair comparison design, the logistic analogue to (5) takes the form 

p|^.,>0i6.-6,,a*N exp|a*(6.-6,)| ^^ 
l4-exp|a*(6.-6,)| 

while, tor the ordered category situation, the analogue to (6) is 

/;k/.>^.-r.i6.,r„c.*|== expia.U-r.)| 
1 4- expja*(6.-n)| 

For a separate scaling of categories for each statement, 6, -a in (9) is 
simply replaced by say ^.^ so that different thresholds always pertain to 
ditVcrent statements. 

It is well known that, with the constant factor adjustment of 
- l.Trv, the numerical values of the cumulative logistic and normal 
differ by less than 0.01 over the entire domain of the variable (Johnson 
and Koiz, 1972) and this numerical equivalence to the normal has given 
further justification for use of the logistic. Thus use of the logistic is 
made firstly for algebraic convenience and secondly because any 
differences in statistical results, apart from the unit which is c**ten 
automatically adjusted, are negligible. Bradley and Terry (1952) and 
L>uce (1959) considered the logistic model for pair comparison data in 
which «* effectively is unity, while Birnbaum (1968) and Jensema (1974) 
have considered it for achievement items by analogy to Lord*s (1952) 
consideration of the normal ogive model. 



THH RASCH RATING MODEL FOR ORDE:RHD 
RESPONSE CATEGORIES 
The Rasch models are formalized immediately for the direct response 
design in which a response is assumed governed by both the affective 
value of a statement and the attitude value of the person. Rasch (1961) 
immediately specifies the condition, required by Thurstone, that the 
relative scale values of statements be independent of the scale values of 
persons. (Rasch reached the significance of this requirement, and the 
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possibility of its realization, quite independently ol Thurstone's writings. 
The main steps by which he did arrive at these issues are documented in 
Rasch (1977). The connections to Thurstone are made here for purposes 
of exposition of the relationships among approaches.) Rasch also makes 
explicit the symmetrical condition that the relative scale values of persons 
should be independent of scale values of the statements. Complementary 
conditions, which are also made explicit, are that the relative scale values 
of a pair of statements should be independent of the scale values of any 
other statements, and that the relative scale values of any pair of persons 
should be independent of the scale values of any other persons. Such re- 
quirements may be satisfied within some explicit frame of reference 
which includes defining the class or set of persons, the class or set of 
statements, and any other relevant conditions. 

Specifically Objective Comparisons 

In both the Thurstone and Rasch specifications, the stress is on relative 
rather than absolute scale values. In the Rasch system, this characteristic 
is often enunciated in terms of comparisons. 

Rasch defines comparisons between the scale values of any two 
statements or any two persons, which depend only on the values of the 
two statements or the t\yo persons being compared, to be specifically ob- 
jective. The ^objective' term arises from the feature of independence of 
all other values in the system except the two being compared, while the 
^specific* term is used to indicate that this objectivity is relative to some 
specified frame of reference (Rasch, 1977). 

These specifications immediately give the scale values of statements 
and persons an explicit generality. That is, one can say, for example, that 
with respect to a class of statements and persons, and without further 
qualification, statement A has a greater affective value than statement B. 
Analogously, one can say that within the same frame of reference, and 
without further qualification, person C has a greater attitude than person 
D. Evidence for the difference in aftective values between two statements 
is the same irrespective of person attitude; therefore the evidence from 
every person must point to the same difterence. Again analogously, 
evidence for the difference in attitudes between two persons is the same 
irrespective of statement aflective value; therefore the evidence from 
every statement must point to the same difierence. 

The requirement of objective comparisons can be used e^splicitly to 
check w filch statemenis and persons are so related and then to define 
classes of statements and persons which may be termed ^mutually con- 
formable', f-urther implications of this specification are con.sidered in the 
next section. 
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The Model for Dichotomous Responses 

In the coniexi of staiemeni and person scaling, the ordering of 
siarements and persons on a linear continuum, as articulated by 
Thurstone, is' immediately assumed. Accordingly, let and 6, be real 
numbers which characterize the scale values of person v and statement / 
respectively. Then the response of person v to statement / is governed by 
some function 6,, = d[0^, 6J. To begin with, consider only a dichotomous 
response of endorsement or rejection rather than the Likert response 
which has degrees of endorsement or rejection. The response, while 
governed by is not completely determined by it; therefore a random 
variable K,, which takes on the value v„=0 for one response (rejection 
say) and \\, = I for the other (endorsement) may be defined. Associated 
with the respective values are the probabilities 

and P\y.,^\A.b^=Me..)^\-Me.x ^ 

Both the function determining the structur^^l relationship of ft, and 
6., and the function / giving the probability distribution, must be 
specified. The necessary and sufficient conditionN^equired to satisfy 
specific objectivity (Rasch, 1968) is thai 0„ = exp(,'i, - 5j, or its 
equivalent, and that J\{0„)-OJ(\^d,,). Entering these functions into 
(10), gives 

p|Vw-0)/i.,6J = l/^. 
and p\}\.^\ :/iv,6.| = lexp(d.~6.)|/i^.. (11) 

where i/'., = 1 +exp(/3.~6,) 

in which the logistic form of the modd becomes evident. This is known 
as RaschVs simple logistic model (SLM) for dichotomous responses, and 
the consequences and illustrations of why this model permits the elimina- 
tion of the person parameters while estimating the contrasts among state- 
ment parameters is well documented (e.g. Rasch, 1960, 1961, 1968; 
Andersen, 1973a, 1973b; Wright, 1968, 1977; Fischer, 1973, 1976). In his 
paper, Douglas discusses these statistical issues in some detail. The key 
feature of the model is that the estimation of statement parameters in- 
volves firstly identifying a sei of sufficient statistics for the person 
paramefers and secondly conditioning on these statistics so that the per- 
son parameters are eliminated from the resulting probabilityexpressions. 

l or completeness and clarity, a small example may be useful. If two 
statements 1, 2 whose paiameters b\ and 6^ are to be compared are 
responded to by a person v with parameter /I, the response set is the 
ordered pair (.r,. v.-) with (he sample space () - |((),()), ( 1 ,0), (0,1), (l,l)t. 
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With each element of the sample space the total score 

is an observable statistic. The pattern of responses and this statistic are: 

(0, 0) 0 

(1,0) 1 
(0, 1) 1 

(1, 1) _2 . 

The statistic r, can be seen to partition the sample space into the sub- 
spaces 0, -.|(0»0)K e2 = K0»l), (1,0)1 and 63 = KM)!- Now if within a 
sub-space so defmed — that is, conditional on the sub-space or equiva- 
lently conditiofial on that value of the statistic — the distribution of 
elements is independent of a particular parameter of the model, then that 
statistic is said to be sufficient for that parameter. t It is sufficient and no 
further information regarding the value of the parameter can be obtained 
by taking account of the pattern of responses. In the case of th^ simple 
logistic model for two statements, it can be shown that 

P\ , , >^.,) I = 1 , 6, , 6. 1 = ^^P( - ^^y- ~ ^^y^-^ ( 1 2) 

exp(-6i) + exp(-a2) 

which is independent of the person parameter 0,. Therefore irrespective 
of the attitude values of the persons, which can be expected to be 
ditVcrent, each response within the sub-space is a replicate of each other 
with respect to the same parameters, 61 and 62, which are to be com- 
l^ared. When sufficient statistics for parameters can be obtained so that 
one set is eliminated while another set is estimated, the parameters are 
often said to be separable. 

The Model for Ordered Poiychotomous Responses 

The generalization of this model for more than two ordered categories, 
which is of interest in this paper, has been described in Andrich (1978a, 
1979), Wright and Masters (1980) and Masters (1980), where the latter 

<• I he principle o( sutficicncy was ohservcd by R. A. Fisher in 1922 (Rasch. 1960)- Rasch 
suidied \siih hisher in 1935 while these ideas and the associated theory of maximum 
likelihood were being developed. In his work, Rasch shifts the emphasis of the sutficient 
statistic trom estimation to the elimination of parameters, Andersen (I973b) developed the 
theory oT conditional inlerence. 
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authors give a slightly different rationale from the one presented here. 
Therefore the exposition below will be relatively brief. 

First, suppose that every statement again has an affective value 6, and 
that the categories qualify this affective value. Specifically, let tj, 72, . , . 

be real values designating thresholds or boundary points between 
categories where these threshold values are on the same scale as the state- 
ment affective values, and r, >Tfc-, for /: = 2, . . m. The thresholds are 
taken to qualify the statement and therefore effectively increase or 
dccicase iis affective value. Accordingly, suppose the thresholds and 
statements values are related additively and enter the model in the form 

Second, suppose a response process of the form (1 1) at each threshold. 
Then with the addition of the threshold parameter, the response at 
threshold k is modelled by 



Third, consider for the moment, and for simplicity, the case of only 
two thresholds and three categories. If the responses at each threshold 
arc assumed instantaneously to be statistically and experimentally in- 
dependent, the set of possible outcomes or the sample space Q for 
responses to the two thresholds is the set of ordered pairs Q = |(0,0), 
(1,0), (1,1), (0,1)1, where the first member of each ordered pair indicates 
the response at the first threshold. The probabilities of these outcomes 
arc given byt 

/;|(0,0) t!=1/^,^2 
/;|(K()),/i., 5„T|-[exp(A.-(5. + 7,)|]/0,^. 
/;1( 1 , 1 ) : 6., t| - [expl/i. ~ (6. + 71 ) + ~ (5. + 72) , \Pi 
and /;((0, 1 ) /i„ 6„ t| = [exp(/i., - (6, + t2)\]/\Pi\P2^' 

After considering each threshold separately, the person must bring the 
two processes together and, in doing so, recognize the ordering of the 
thresholds and the categories. Consequently the person must recognize 
that the pair (0,1) reflects a response above the second threshold and 
below the first, that is, an incompatible pair of responses. Therefore sup- 
pose the response (0,1) is not recorded if it occurs instantaneously, and 
tiuit it is reconsidered and eventually distributed in one of the compatible 
pairs of responses. (Note that if these responses at each threshold were 
spaced in time so that memory, say, played no part, such an .outcome 

• A \cvinr \.jrKjb!c is set in bold t>pc face. e.g. t tor (n, t> t,„). 




where 



p\y,= \\0.. 5„ T,| = [exp|/3.-(5. + 7jt]/^.., 
v^.„=l + exp|/3.-(5. + n)|. 



(13) 



J X.J 
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could occur. For example, in grading or rating a paper as excellent or not 
on one occasion, and fail or not fail on another with respect to a three 
level category set of (fail, pass, excellent |, it would be possible for the 
paper to be rated fail on one occasion and excellent on another,) 

Fourth, suppose that each response in the sub-set fi' = 1(0,0), (1,0), 
(1,1)1 of compatible responses retains the same relative probability as in 
the full space fi. The appropriate probabilities are obtained simply by 
normalizing the probabilities with respect to fi'. After some algebraic 
rearrangements, these are given by 

/7|(0,0)|/5.,6.,t! = 1/7. 

p\( 1 ,0) I 6., T j = [expliS. - (6. + r,)|]/7v.. (15) 

p\(\M\fi.. 6„Tj = [exp|i3.-(6. + r,) + /3.-(6. + r2)jl/7v. 

where 7.. = 7(i^- 6.,t)= I +exp|i3.-(6. + r,)| 

+ exp|i3.-(6. + r,) + /3.-(6. + r2)|. 

To simplify (15), first define the random variable X. to take the in- 
tegral values x corresponding to the vector responses in fi' according to 

,v = 0 for (0,0), 
A'= I for (1,0), 
and .v-2for(l,l). 

Clearly, the value a\, indicates the number of thresholds exceeded, where 
,v,. ^0 indicates that none has been. Second, observe that 

/^.-(6, + r,)= ~n +(/3.^-6.) 

and that 

A ((S. + r, ) + /i. ~ (6. + r:) = - r, - + 2(^, - b). 

Third, define the 7 combinations according to hi = - t, , = - ri - r^. 
Then (15) may be simpHfied to 

/?jX.-0A.6.,T|= 1/7.0 (16) 
/;j a;. - A-; 6„ t| = expU. +-v(/5. - 6.) 1/7., 

Finally, generalizing to m thresholds and /»+ I categories, and defining 
=-0 for A'-O, (16) becomes 

/7jX.=Ali[3..,6.,T|=explH. + Ati3.-6,)|/7.. (17) 

where 7-= i: expl + A'(i3. - 6.)|. 

Equation (17) is the rating response model to which the rest of the paper 
is primarily devoted, 
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The most interesting and important feature of (17) is that the non- 
negative integral values of the random variable appear conveniently in 
the probability distribution and make it a member of the exponential 
family. Special cases of this distribution are the well-known binomial 
and Poisson distributions in which (exp ^J=('^')in the former and 

(exp H.)- l/.v! in the latter. The model of equation (12) for dichotomous 
responses is also clearly a special case. Masters ( 1980) gives a further ac- 
count of the models of this family in relation to Rasch models, as does 
Douglas at this conference. 

Next, and as a consequence, the sufficient statistic for the person 
parameter /j. becomes simply n = La,,, which is identical to the result in 

the dichotomous case. Analogously, the sufficient statistic for the state- 
ment parameter is - 1;av. and the sufficient statistic for the category 
coefficient parameter is = /J' where / is an indicator variable which 
takes the value 1 if a response is in category ,v, and 0 otherwise. That is, 
r, IS the tot?i number of responses, over all persons and all items, in 
category x. The various arguments which actually eliminate the 
parameters through a conditioning on these statistics will not be pursued 
here. (A distinction between a set of jointly sufficient statistics for a set of 
parameters and individually sufficient is not made here though, in 
developing statistical machinery, the distinction can be important 
(Barnard, 1974).) 

In addition, the category characteristic curves have intuitively de- 
fensible and appealing properties. These can best be observed from the 
example shown in Figure 3 in which curves are drawn for the case of four 
thresholds. Firstly, the thresholds are equally spaced about an origin of 
zero. While this is not necessary, the thresholds can always be centred 
about zero (Andrich, 1978a) without any loss of generality. This makes it 
convenient for the interpretation of thresholds as qualifying affective 
values of statements, some decreasing and some increasing their atVective 
values. Secondly, as /i.. gets larger than 6„ the probability of a high score 
increases. Thirdly, the re.spon.se with the highest probability corresponds 
to the interval in which /i, falls. Finally, although the assumption of 
independence of decisions at thresholds as a first step in deriving the 
model might seem Vounter-intuitive' initially (McCullagh, 1980), it is 
clear from equation (17) that in the final respon.se proce.ss, the respoase 
in any category is dependent on all thresholds. 

An application of this model, fdr the ca.se /;?> 1, which is identical in 
principle to the now relatively well-known application for 1, is pro- 
vided in Andrich {1978b) and therefore formal aspects of the analysis of 
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Figure 3 Theoretical Category Characteristic Curves for Four 



data will not be presented here.t 

Further implications of the model are discussed in the next section in 
which explicit connections to the Thurstone and Likert approaches to at- 
titude measurement are made. Before proceeding to those connections, it 
h significant to note that tiii'- model goes beyond the Likert-style ques- 
tionnaire situation. The mode, may be entertained for any rating situa- 
tion, which is very common in the social and biological sciences. These 
considerations have been explained for a contingency table context in 
which the dependent variable is a rated variable (Andrich, 1979). 



To demonstrate the full level of unification that the rating model ap- 
proach brings to the Thurstone and Likert perspectives and practice, a 
brief comparison of their characteristics is summarized first. 

Comparisons and Contrasts between the Thurstone and 
Likert Traditions 

A comparison of the Thurstone and Likert approaches reveals an in- 
teresting contrast, despite their virtually simultaneous development. The 
Thurstone approach is characterized by (i) providing statement scale 
values, (ii) being time consuming, requiring a judgment group to obtain 

i" A I'ortrun IV ct)mputcr program which analyses dala according to this model is 
available, at a nominal cost, from the Measurement and Statistics Laboratory, Department 
ot* Iiducaiion. The University of Western Australia, Nedlands, Western Australia, 6009. 
An I-.RDC* grant to the author and Ci. A. Douglas (principal investigators) supported the 
development oi the program and is gratefully acknowledged. 
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the scale values and requiring that scale values be independent of the at- 
titudes of the judgment group, (iii) being statistically rigorous in 
establishing a continuum, and (iv) somewhat ad hoc in obtaining person 
scale values which, asymmetrically, do depend on the distribution of 
statement scale values. The Likert approach is characterized by (i) not 
providing statement scale values, (ii) being relatively simple to apply, not 
requiring a judgment group and making no mention of any independence 
of attitude values of groups, (iii) not having a statistical model and 
therefore not having statistical rigour in establishing a continuum, and 
(iv) being direct in obtaining person values. 

Spanning these Traditions with the Rating Model 

Now consider the Rasch rating model in relation to these issues. First, for 
the purposes ot person measurement, statements may be responded to in 
either the Thurstone tradition of rejection or endorserpent, or the Likert 
tradition of degrees of rejection or endorsement, where the latter is seen 
simply as an extension of the former. 

Secondly, and as a consequence of the model and not for reasons of 
either conceptual or numerical approximation as in the Likert tradition, 
the successive categories are scored with successive integers where the 
first category is scored by zero. Thus in the Thurstone case of a response 
of rejection or endorsement, the scores are 0 and I respectively, while in 
the Likert case these are extended to 0, 1 , 2, 3, 4 in correspondence to the 
extended responses. Furthermore, although the score must be trans- 
formed to place a person's attitude on a rational scale, because the total 
score of a person across statements is a sufficient statistic for the attitude, 
the first stage of summarizing the responses is the same as in the Likert 
approach. 

Tfiirdly, statement scale values which are independent of the attitudes 
of the person as required in the Thurstone tradition, are estimated. The 
advantage of scale values for statements even for Likert style formats is 
that they help define the continuum in a more tangible way. How this 
feature is exploited is shown in the next section. 

Fourthly, because the data collection design is of the Likert style, it is 
simple and does not require the time-consuming involvement of a 
judgment-group. 

Fifthly, the statement scale values do not have to be equally spaced on 
the continuum because the altitude estimate is independent of the scale 
values of the statements. A more or less judicious choice of statements 
with rcspcc( to their values can be made with a view to minimizing the 
error of measurement, just as in choosing items of appropriate difficulty 
in achievement testing (Wright and Douglas, 1977). 

Sixthly, and as in the original Likert approaches with respect to the 
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weighting of categories, thresholds between categories are estimated. 
However, the threshold values do not need to be equally spaced to justify 
integer scoring. 

Finally, while employing the simple Likert-style data collection design, 
exact probability statements as with the Thurstone models can be made 
regarding fit of data to the model. 

From the above list, it should be transparent that not only does the 
Rasch rating model synthesize and account for the apparent differences 
between the two traditional approaches to attitude measurement, but 
that in doing so, it retains the best theoretical and practical features of 
both. Most importantly, it does this simply and elegantly. 

THE RASCH TRADITION 
The above list of points deals with the most obvious relationships among 
the Thurstone, Likert, and Rasch approaches. An interesting issue to 
consider is that the rating model, which has so many characteristics con- 
sistent with theory and practice of the two traditional approaches, even 
where these appear in conflict (Ferguson, 1941), was not motivated with 
any explicit intention to reconcile the two approaches. None of the three 
papers which are most directly concerned with the evolution of the model 
(Rasch, 1968; Andersen, 1977; Andrich, 1978a) deals with real or 
simulated data or make reference to Thurstone or Likert. This section 
traces the development of the rating model through these papers, stress- 
ing the importance of the emphasis on sufficient statistics, and shows that 
the independent development of the model facilitated its having proper- 
ties of the other approaches. 

Sufficiency 

Rasch's paper generalizes his SLM for dichotomous responses to th e case 
of polychotomous responses and in the first instance, the model is of m 
dimensions for person and statement parameters. Briefly, if persons and 
statements are characterized in the first instance by vectors /I. = (i3,/, /J.:, 
. . and d, = (6.,, 6.2, . . 6.m) respectively, in which the maximum 

number of independent vectors is one less than the number of categories 
as in the dichotomous casef, and if the set of values of the discrete ran- 
dom variable X is extended from |0,11 to (0,1, . . . . . m|, where no 
meaning other than a naming of the categories for identification is 

t The vector parameters arc denoted by bold type face as in 'fiJ and *dA When a spccihc 
element of a vector is considered, the bold type face is replaced by an extra sub-script 
denoting the specific element, as in /i,. and A unidimensional or scalar parameter is 
recogni/ed by having neither bold type face nor ar element subscript as in and 6,. The 
term 'dimension' is not used strictly in a traditional test theory sense. It simply refers to the 
number of independent parameters ascribed to the persons and items. 



ERLC 



108 



The Improvement of Measurement 



associated with the numbers at this stage, then the generalization of (I I) 
is given byj 

>(X. = 0|/}„a.,/;/|=|/^^. 

P\X^, - .v.. I /I,, i5 /;;| = (exp(/i,, - 6 J]/,^,. ^ * 

Equation (18) clearly specializes to (II) for /;/=l. 

The rationale for developing this generalization for polychotomous 
responses is based again on the requirement that the parameters be 
separable and both Rasch (1968) and Andersen (1977) demonstrate the 
modePs umqueness in satisfying the requirement. Given that the maxi- 
mum number of independent parameters for each person and each state- 
ment IS w, the question arises as to the possibility of reducing the number 
of parameters. Rasch (1968) provides the equations in which the m per- 
son and statement parameters are reducible to any number less than m 
The particular one of interest here is when that number is 1 in which case 
the model becomes unidimensional and the categories reflect an ordering 

When the vectors and can be expressed as linear functions of a 
single parameter A. and 6, respectively, for example according to 

^i.r - V,-f (^.(/i) and ~ 5,. = -f ~ 6,), 
(18) reduces to the form 

P\X., - .V,. j 6., le, 0, m\ - (expl -^"<t>X^. - (19) 
where x.^h[-^k: 
and where 7.. -7(/j., 6,,ie, ^):= £ expU,-f 0,(/i^6.)|. 

As in (17), the k are the category coefficients while the 0 are the scoring 
functions, where k,,=. ct>o -^0. The relationships between the it'sand in 
(17) and (19) are explained shortly. 

For analysing data according to (19), the techniques developed involve 
hrst estimating the /^/-dimensional statement parameters and then factor- 
ing these parameters according to 6.= (Andersen, 1973a; Spada 
and Fischer, 1973; Allerup and Sorber 1977). If a likelihood ratio test 
shows that the ///-dimensional and factored form are not significantly 
different, then the hypothesis of unidimensionality is accepted Once the 
h and 0. have been estimated, the and can be estimated uncon- 
d'tionally by a generalization from the dichotomous case (Wright and 
Panchapakesan, 1969; Andersen and Madsen, 1977). 
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There are, however, two interesting and related characteristics of (19) 
which require comment. First, while the multidimensional parameters d< 
can be estimated independently of the person parameters /I,, the 
. parameters 0, and 6. cannot be estimated independently of each other. 
Secondly, even with known 0, and 5,, in general the estimate of for 
person v permits a complete recovery of his pattern of responses across 
the categories. Therefore, in a well-defined sense, there is no data reduc- 
tion in the process of estimating a person's parameter. This implies some 
kind of pseudo-estimate of a unidimensional parameter which in reality 
is still multidimensional. 

In relation to making the latter observation, Andersen (1977) in- 
vestigated the model in (19) and established that if there exists a sufficient 
statistic for a unidimensional parameter jS, which is a function of data 
only, then the differences - 0, for all x< m must be equal. Following 
on from Andersen's work, I provided (Andrich, 1978a; 1979) an inter- 
. pretation of the category coefficients and the scoring functions essentially 
in the form presented in a previous section, except that the response pro- 
cess at each threshold -V= 1, , . m, was initially parameterised to have 
possibly a different discrimination a,. The Birnbaum (1968) response 
model at each threshold, rather than the Rasch SLM of equation (1 1), in 
....... the form 

P\yk = 1 1 6„ T. a* I - [exp[aJi3. ~ (6, + r,)|]]/^.* (20) 

where 1/^.,* = 1 +exp[aJj(3.-(6. + TA)l] was postulated. Applying the ra- 
tionale presented earlier to this model gives equation (19) in which 

XQ^<f>Q=^.0, x,^ - i; a*TA, and 0,= i; a*. x= 1 , 2, . . ., m. Then if ^he 
discriminations a, are the same at each threshold, this value can be ab- 
sorbod into the other parameters with the result that Xr- - I! r* and 

<^r = x, giving (17). 

Given that the model (19) is generated by a model different from the 
SLM at each threshold, it is not surprising that it does not subscribe fully 
to the requirement of having a sufficient statistic for jS,. However, it is 
stressed that the derivation of (17) was not made through a ispecialization 
of the Birnbaum model (20). Instead (17), or its algebraic equivalent, was 
first developed by Andersen without interpretation of the parameters k 
and 4 given here. In addition, I derived model (17) before realizing its 
connection with models (19) and (20) in terms of discriminations at 
thresholds. In the presentation of the model (17) (Andrich, 1978a), I 
derived model (19) first and then specialized it to (17), but this was for 
purposes of efficient exposition. 
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In relation to the devclopmeni of (19), it is stressed that the search for 
a sufficient statistic in the Rasch approach is directed primarily by the 
perspective of e/itninating parameters. While this has particular implica- 
tions for estimating parameters, it contrasts with the usual approach to 
ordered categories, exemplified by Samejima (1969), in which the em- 
phasis is on estitnation. With only this latter emphasis, the development 
leading to the raring model may have stopped with algorithms for 
estimating the scoring functions 0, (Andersen, 1973a; Spada and Fischer, 
1973). (The substantive interpretations of the scoring functions 4 in 
Andersen, and Spada and Fischer are quite consistent with the idea of a 
discrimination at each threshold. However, no similar prior interpreta- 
tion of the category coefficients x seems to have been made,) 

A consequence of this approach, in relation to that of Samejima\s in 
which the cumulative probability generalization is used as in (10), is in- 
teresting to consider. Although the emphasis in the latter approach is on 
estimation, no simple explicit expression for the response in each 
category follows and the probability of response in each category is the 
difference of adjacent cumulative probabilities. Therefore no simple 
sufficient statistic for estimation of parameters follows. The surprising 
consequence is that the apparently more straightforward generalization 
of the dichotomous to the ordered category model, which was used more 
or less formally by both Thurstone and Likert, does not provide the com- 
prehensive synthesis of those approaches as does the independently 
derived Rasch generalization of the dichotomous model. This feature is 
not simply a result of the algebraic formulation because the Rasch rating 
model and the Samejima model cannot be transformed into each other. 

The other important contrast of approaches is that they generate prob- 
lems which are unique to each approach. In the Rasch approach, estima- 
tion ideally is carried out through conditional distributions (Andersen, 
1973a, 1973b) and even though the models are simple and estimates have 
desirable properties, the implementation of algorithms tor solving the 
resultant equations can become complex. 

In attempting to find approximations to the estimation which may be 
more efficient, the consistency of the estimates is always a concern 
because of the demonstration by Andersen {1973a), that unconditional 
estimates -that is, joint estimates of the statement and person 
parameters -in the dichotomous case are not consistent. (For a fixed 
number of statements, as the number of persons increase without limit, 
the parameter estimates converge to values which are not the actual 
parameter values.) In the Samejima approach, while the expression for 
the probability of each category is more complex, the estimation is car- 
ried out by the more straightforward unconditional approach. In such 
approaches actual convergence of estimates in iterative algorithms for 
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implicit equ^'bns may sometimes be at issue, but the actual consistency 
of estimates ;n the sense of Rasch is not as explicit a concern. Thus the 
numerical methods problems are distinctly different. 

Parameters Elimination and the Terms *Population-free' 
and 'Sampie-free* 

The focus of Rasch on paTameter elimination together with some conse- 
quent issues have just been described. The possibility of eliminating 
parameters explicitly is considered generally a desirable property for 
psychometric models and the terms *population-free*, *sample-free\ 
*person-free\ and *item-free' have been coined with respect to Rasch 
models. 

However, these terms are not always fully appreciated; sometimes they 
are taken to imply more and sometimes less than what they actually 
mean. The confusions often stem from the dual uses of both the terms 
*population' and ^sample*, and the latter especially in relation to the idea 
of sampling distributions. 

One use of population is associated with the specification of a class of 
objects or people as, say, in the population of 15-year-olds in Australian 
high schools. The other use is with respect to numbers associated with 
each of the members of the class with respect to some variable. For 
example, it might be said that the numbers indicating the degree of 
achievement on some test are normally distributed. In relation to the lat- 
ter use, random sampling has the virtue that distributional properties of 
random samples are well specified, hence the common use of the random 
sample to represent some population. The confusion, of course, readily 
arises because to get the random sample of numbers, one selects the 
members of the class at random. However, conceptually, these are 
different and it is with respect to the numbers associated with people and 
their distribution and not with respect to the specification of the class of 
people, that the Rasch models are population-free. They are free of 
distributional populations, not of classes of people. Because any member 
of a population as a class can be selected, there is no need to invoke ran- 
dom sampling and its consequent distributional properties to check on 
the strucitire of the scale and the quality c f measurement. In this sense 
the models are also sample-free. 

In general, both a class of statements and a class of persons are en- 
visaged when some attempt at measurement is made. To check the con- 
formity of the statements and person parameters, various tests of fit can 
be applied. Both person-fit and statement-fit (Wright and Stone, 1979; 
Wright and Mead, 1977; Mead, 1976) can be examined in order to isolate 
and understand why members of the person and statement classes, con- 
sidered on a priori grounds to belong to the same class, do not conform 
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with respect to the measurement procedure. Within a conformable class, 
any members can be selected, and in that sense the models are person- 
free and statement-free, but the classes have to be defined and confirmed 
to be conformable. 

In addition, by examining difierent classes of statements and persons, 
for example from two scales devised separately but on the same issue, or 
two classes of persons such as all 14-year-olds and all 15-year-olds in a 
school system, and checking if they conform with each other, the 
generality of the variables and scales can be extended. 

If after the generality of a scale is demonstrated across difl'erent sub- 
classes of people, one wished to compare two sub-classes such as all 
14-year-olC and all 15-year-olds with respect to location and dispersion 
on the trait, then a random sample from* each sub-class ought to be 
selected. But then the aim is to describe a distribution of a population of 
numbe'rs with respect to some class, not to confirm structural properties 
of the class. This point also demonstrates that it is only the relative state- 
ment scale values that are objective, and not the absolute values. 

In this connection, it might be stressed that the *distribution-free* pro- 
perties of the Rasch models, to use another more general term, is a 
property of the models, not of data. That is, to demonstrate distribution- 
free properties of models, one only needs to consider the models. Only if 
real data conform to the models can the corresponding characteristics be 
applied to the data. A check if data accords to the model can involve 
checking that the statement or person scale values are distribution-free. 
When the relationships among statements are not distribution-free or 
attitude-free with respect to two classes of people, it may be just as infor- 
mative as when they are, because then a potentially significant difference 
between the classes of people has been exposed. In this sense, the infor- 
mation is analogous to that of Thurstone in the pair-comparison ap- 
proach in which populations are described, not by their respective means 
and variances, but by the scales they generate. 

Another Connection to the Thurstone Tradition 

It was observed earlier that Thurstone realized the significance of in- 
variance of relative statement-values across persons with ditVercnt at- 
titude values. However, he never formalized this feature in his models. 
Therefore it is opportune to note here, and as shown in detail elsewhere 
(Andrich, 1978c), that by formalizing Thurstone's own verbal statements 
regarding (he discriminal process of equation (1), and by rearranging the 
error term to separate the among-person variance from the within-person 
variance, the law of comparative judgment applied to the pair- 
comparison design does eliminate the person parameters. Thus 
distribution-free statement-scaling is met by the pair-comparison design. 
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This effect is manifested by the correlation e.y in (4) being zero, and 
although in his data analyses Thurstone consistently assumed 9,7 = 0 (c.f. 
Thurstone, 1927b, 1927c) and although in an empirical study 
(Thurstone, 1931) he calculated that the observed correlation was indeed 
virtually zevo, it does not appear that he ever related this assumption and 
evidence to the elimination of person scale-values. Thurstone*s specifica- 
tions must be modified to realize it, but his requirement of random 
sampling of persons is in fact not necessary. Interestingly, it seems that it 
is difficult to motivate this particular modification unless one Mas the 
perspective of explicit person parameter elimination which is so central 
to the Rasch tradition. That is, to appreciate this characteristic in 
\ Thurstone's model, one must look at it t>om a Rasch perspective. 

Not only are the person parameters eliminated in the pair-comparison 
design but, if the logistic distribution is substituted for the normal and 
the discriminations a, are assumed to be the same for all statements, 
which is Thurstone's Case V specialization of (4), then the statement 
scale values for the pair-comparison design and the direct-response 
design (whether dichotomous or polychotomous as in the Likert format) 
will be the same according to the models. Analogously, when statements 
are categorized as in the equal-appearing-intervals design, then each 
statement can be considered to have been rated. Accordingly the rating 
model again can be applied. Thus the .scale values obtained by the equal- 
appearing-intcrvals design, the pair-comparison design, and the direct- 
response design, all provide, according to the models, the same statement 
scale values and all are free of an> attitude of the persons involved in the 
data collection. Whether or not sets of data show these properties is an 
empirical question, but to the degree that they do, then to that degree 
generality of relationships is demonstrated. 

Another Connection to the Likert Tradition 

The estimation of parameters with sufficient statistics complements the 
elimination of such parameters. The issue of estimating the attitude /i, 
for person v on a rational scale by transforming the total score f\ =■ ilv,, is 
taken up again here. 

If the statement and threshold values are assumed accurately 
estimated, then t lie direct maximum likelihood equation 

r, - L X expl h , + A'(/i.. - 6.)|/'>. (2 1 ) 

can be usc^! to estimate/^, (Andersen and Mad,sen, 1977; Andrich, 1978b) 
and the associated -^ ror variance of /}, is approximated by 

a\^\/^^\():x'lh.)-(L v/7,J^| (22) 

where p., is given bv (17). 
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Table 1 Transformation of Total Scores to Attitude 
Estimates for a Conformable Set of 16 Likert- 
style Statements without the Undecided 
Category 
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Clearly all persons with the same total score will have the same attitude 
estimate. In addition, the transformation of total scores r, to attitude 
estimates is monotonic. Therefore, if there is reasonable variation 
among total scores of a set of persons, the total scores will correlate 
closely with the attitude estimates. A transformation of total scores to at- 
titude estimates for the example of 16 Likert-type statements with four 
ordered categories described in Andrich (1978b) is shown in Table 1. The 
same relationship is portrayed graphically in Figure 4. It is apparent 
from this figure that there is a wide range, approximately 5 to 40, in 
which the scores r, and are virtually linearly related. In the real data 
approximately 95 per cent of the 284 persons were in the range from 17 to 
40. This type of relationship perhaps helps explain the success of Likert's 
integer scoring and, in an interesting sense, renders unnecessary the con- 
cern of people who assumed that integer scoring depended on equal 
distances between thresholds bordering the categories. 

However, an interesting question which might be asked is. Why bother 
with the transformations if the total scores will suffice? Related questions 
are, what effect, if any, do the affective values have if the total score is all 



ERIC , i24 



Models to A nalyse A ttitudinal Data 



r 

47 4 

43 
4v) 
37 
34 
U 
28 - 
25 - 
22 - 
1> - 
16 - 

iJ - 

10 - 



Limits of central 

of 2B4 person responses 



Figure 4 Relationship of Total Scores and Attitude Estimates for a 
Conformable Set of 16 Likert-style Statements without the 
Undecided Category 



that is needed, and does it not make any difference which statements are 
endorsed or rejected. 

Answers to all of these questions help expjain further the aspects of 
the rating model and how more can be gained by using it than can be ob- 
tained by simply using the total score. First, while the monotonic rela- 
tionship between and is a straightforward algebraic relationship, the 
meaning of the scores and estimates rs only valid if the responses accord 
to the model. As already mentioned, explicit checks of person-fit and 
item-fit are available to help define the class of statements and persons 
which are actually conformable and which indicate which statements and 
persons need special consideration. In the case of statements, the special 
consideration may be related to its wording or the like. Presumably, 
when these statements are constructed and placed in the questionnaire, it 
is expected that they conform with the other statements. To the degree 
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Table 2 Response Pallerns of Three Persons with the Same Score lo 
Ten Slalemenls 

Increasing affective values 



Slalemenls (/) 

Persons 

(^) 1 2 3 4 5 6 7 8 9 10 



[ 3 3 2 2 1 2 0 1 0 0 14 

2 3 2 3 2 1 2 1 0 0 0 14 

3 .0 3 0 1 2 3 0 2 0 3 14 

that they do not, to that degree the theory or principle on which the 
statements are generated is not mastered. Understanding the source of 
statement-misfit can therefore further help clarify the attitude variable. 

The case of misfitting persons indicates that they were not measured as 
intended. That is, their total score is not 'sufficient' to account for a single 
attitude and they are not comparable with other persons on^the same 
scale. Such persons too can contribute to the refining of the variable. 

Requiring conformity to the model means that only certain patterns 
with a specified amount of random variation are permissible.. If two 
people have the same total score, they will also have a similar pattern of 
responses. The total score is sufficient for the parameter estimate and no 
further information for the parameter can be gleaned from the response 
pattern, but the pattern can be used for checking the fit. For example, 
corisider a four-category response case of ten statements with increasing 
afl'ective values from left to right. The response patterns of three people 
each with the score of 14 are shown in Table 2. 

Persons 1 and 2 could readily be conformable and it would be ea.sy to 
believe that the slight difference in pattern is due to random fluctuation. 
However, for a total score of 14, the response pattern of person 3 would 
be considered odd. This person endorses statements of greater or le.sser 
affective values about equally. Such a response pattern would be diag- 
nosed as misfitting the model. Thus the weighting of statements in terms 
of their scale values plays an important role in recognizing unusual pat- 
terns and in confirming that a total score can be used in equation (21) to 
estimate /j^.f 

Another situation where the eft'ects of statement scale values can be 
seen is w hen all persons do not respond to the same statements. Since the 

t For a coiiirashng viQ>v on (his issue, based on a non-Rasch perspective on sufricicncv 
see .Samcjima (1969). Chapter 10, eniiiled, *Some observations concerning the relationship 
bet wccn (ormulas tor the item characterisiie funciion and the philosophy oC scoring' 
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person scale values should be independent of the statement values, any 
sub-set of statements from a conformable set should give statistically 
equivalent person measures. This is particularly useful in constructing 
'parallel forms' of questionnaires when repeated measurements are re- 
quired and the same statements are avoided to reduce the effects of 
memory of specific responses to the same questions. But, in this case, the 
same total score from the two forms will not, in general, have the same 
altitude estimates. For example, if the first form happened to have 
statements with somewhat higher affective values than the second fo^m, 
then a total score on the fi rurm would correspond to a greater attitude 
value than the same score on the second form. 

The Rating Model and the Undecided Category 

It was indicated earlier that two concerns with the integer scoring of suc- 
cessive categories still persist. These are the equidistance of intervals or 
distances between categories in general, and the operating characteristics 
of the middle or Undecided category in particular. A justification of in- 
teger scoring through the rating model, without reference to distances 
between categories, has already been made. The operating characteristics 
of the middle category, and its manifestations in the rating model, are 
now examined. 

First consider what might be expected. If the middle category operates 
consistently with the other categories, then the probability of response in 
the category for any statement should show an appropriate transition 
across categories as a function of in particular, the middle category 
should neither be over-represented nor under-represented. 

The category characteristic curves in Figure 3 reflect such conditions. 
Explicit probabilities for a 5-category response format from say Strongly 
Approve (SA) to Strongly Disapprove (SD), which conform to this pat- 
tern, are portrayed in Figure 5 for three values of K, = fi,'~d, and for 
threshold values of t, = - 1 .20, t. = - 0.40, T3 - 0.40, ta = 1 .20. 

Because the response depends on /3,-6„ the three graphs frqm left to 
right could represent the response patterns of a single person to three 
statements of decreasing affective values or of three persons of increasing 
attitude to a single statement. To illustrate what tends to happen when 
data including the Undecided category are analysed according to the 
rating model, some results from a real data set are now described briefly. 

These data, which involve the responses of 309 Year 5 school children 
in Australia who answered 16 statements called 'questions about school' 
(Western KM'>tralia, Education Department, 1974), have been analysed 
according t0 the case of the rating model (Andrich, 1978d) having 
binomial cojefficients. 

An obvious feature of the threshold estimates, shown in Table 3, is 
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Re5por;3-.- ciD D U A SA SD D U A SA SD D U A SA 
= -1.4 (b) ^.^^ = 0.0 (c) K^^ = 1.4 

Figure 5 Model Response Probabilities in Ordered Threshold Case for 
Three Persons with Increasing Attitude to a Single Statement 
or of a Single Person to Three Statements with Decreasing 
Affective Values = jS, - 6,; ri - 1 .20; r: = - 0.40, 

73=0.40, 74= 1.20. 

that they are not ordered as expected, in particular 73<72. With these 
values of the thresholds, distributions analogous to those in Figure 5, 
and with similar values of = jS, - 6., are displayed in Figure 6. It is evi- 
dent that the distribution for a central value of is bimodal. A general 
principle can be inferred from this illustration, namely, that only if the 
thresholds are ordered is the rating model distribution strictly unimodal. 
Ordered thresholds ensure that the coefficients x, have the relationship 

x,> for all jc= 1, . . ., w- 1 

2 

which in turn reflects the unimodality. The Poisson and binomial 
distributions, which are special cases, have this relationship among 
coefficients. 

A bimodal probability distribution, which for any fixed set of 
parameters = /3,.-6, is a random error distribution, seems untenable 
for a unidimensional variable. In general, if an observed distribution is 
bimodal, it reflects at least two overlapping populations of numbers. 

A manifestation of the reversed thresholds can also be seen from the 
category characteristic curves shown in Figure 7, from which it is evident 
that no matter what the value of = jS, - 6,, the probability of a response 

Table 3 Threshold Estimates from a Real Data 
Set of 16 Statements Including the 
Undecided Category 

k 12 3 4 

u -0.40 0.00 -0J8 0.79 

SE{r,) 0.05 0.04 0.04 0.04 
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Figure 6 Model Probabilities as a Function of K, = i3.-6i for a Real 
Data Set in which Threshold Estimates Show Disordering 

in the middle category is never greater than the probability of a response 
in at least one of the other categories. Indeed, if \„=0.0, where effec- 
tively the person and statement values cancel each other, then one would 
expect that the most likely response would be in the middle or Undecided 
category. However, it is not. The response probability in the Agree 
category is greater than the probability for the middle category. Note 
that this has nothing to do with the distribution of people. The distribu- 
tion of people might be bimodal so that some have a high attitude and 
some have a low attitude and very few have a middle attitude. But this 
will not aft'ect the category characteristic curves. The distribution in 
Figures 6 and 7 pertain to a single individual. 

A reversal of threshold estimates can occur if the discriminations at 
thresholds are not equal but the data are analysed as if they were, that is, 
// an incorrect model is applied. The general perspective taken here is 
that if threshold estimates are disordered, then the data do not fit the 
model. It should be noted that the fit or otherwise on this criterion does 
not invoke a fit statistic with an associated probability. In any case, such 
a statistic provides only a necessary, and not a sufficient condition, for 
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Figure 7 Category Characteristic Curves for a Real Data Set in which 
Threshold Fstimates Show Disordering 
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Table 4 



Threshold Estimates from a Real 
Data Set of 16 Statements without 
the Undecided Category 
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3 



1.00 
0.05 



-0.02 
0.04 



1.02 
0.04 



deciding the fit. The criterion of threshold order arises directly from the 
specification of the model. 

The question is what to do about this. An answer is available from 
psychophysics literature in which a similar problem has been en- 
countered. This is to leave the Undecided category out. In the case when 
the other choices are simply Approve or Disapprove, this category is con- 
sidered best lefl out (Bock and Jones, 1968, p. 3). It seems consistent to 
leave it out also when the intensity of Approval or Disappraval is ex- 
tended. As a practical recommendation, in the piloting of questions the 
Undecided category would be included as usual and then any statements 
which seemed to attract too many responses in this category would be 
modified or excluded. In the final version of the questionnaire, the 
Undecided category would be left out and an instruction given to people 
to make a response even if, sometimes, they were uncertain. 

The analysis of a questionnaire constructed under such principles, 
some results of which are shown in Table 1 and Figure 4, has been 
reported in Andrich (1978b). For a conformable sub-set of 16 statements 
from an original set of 20, the resultant threshold estimates are shown in 
Table 4. As can be seen, the thresholds are in the correct order and the 
distances are symmetrical. 

It must be stressed that the emphasis on the above results is not on the 
actual values but oa their symmetry, and that this symmetry confirms the 
suitability of the rating model for analysing such data. The threshold 
estimates were obtained by an unconditional estimation procedure and, 
therefore, the quality of the estimates in relation to consistency is not 
known. 

Another point that needs mentioning is that, interestingly enough, the 
objectivity requirements of the model are not destroyed if the natural 
threshold order is violated. Primarily on the basis of this fact, Wright 
and Masters (1980) and Masters (1980) are prepared to accept disordered 
thresholds as not violating the model. Masters also gives an extensive 
review of category characteristic curves obtained in psychophysics 
research and discusses the notion of 'response set' with respect to the 
Undecided category. 
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Finally on this point, it is noted that the traditional cumulative prob- 
ability approach would not reveal so vividly the anomaly exposed above. 
Because the probabilities are accumulated, and the logistic transform ap- 
plied effectively at each accumulated point, the thresholds' estimates 
must be strictly ordered. In the Rasch model the logistic transform is ap- 
plied effectively with respect to each pair of adjacent categories, hence 
the possible disclosure of disordered thresholds. The corresponding 
result from the cumulative probabilities approach would be a smaller 
distance between the two middle thresholds than between the other 
thresholds. Although the notion of threshold is similar in the two ap- 
proaches in that in both it refers to a cut-off point on a continuum, the 
thresholds are actually formally defined differently in the two approaches 
so that different values are obtained from the two models. 

Further Aspects of the Rasch Tradition 

The elegant features of the rating model, a member of the Rasch models, 
does not obviate the need for patient, careful, insightful, and sometimes 
laborious construction of statements. Indeed, because of its explicitly 
demanding requirements, the care required may sometimes be greater 
than in traditional approaches. The reward, however, is the generality of 
the statements constructed with respect to the variables conceptualized. 

The rating model itself, as indicated earlier, is relevant beyond the 
Likert-style questionnaire context; in a sense, that case is only an 
example of a rating system for classifying ordered data. In the social and 
biological sciences, a rating system is used usually when formal 
measurements cannot be made. Application of the rating model to such 
data can, therefore, provide a check on the quality of rating mechanism 
and help place the results of ratings on the same level as that of usual 
measurement. The only diflererce between rating and measurement then 
becomes the degree of accuracy, which in any case can also be estimated 
from the model. In the physical sciences, the very process of measure- 
ment is used to clarify and understand variables. Modifying and im- 
proving a rating procedure with respect to some variable according to the 
rating model should help clarify variables, which cannot be ordinarily 
measured, in the .same way.t 

In this context, it is also worth noting that, in the physical sciences, the 
different variables involved in lawful relationships and the lawful rela- 
tionships themselves are defined simultaneously (Kuhn, 1972) and that 
these very definitions often involve measurements (Ramsay, 1975). 

t The prevalence of ihe raiing scale in social science research is testil'ied to by Dawes 
(1972) who states that some 60 per cent ot* studies involve as dependent variables only rating 
type variables. 
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In contrast, there is a tendency in psychology and education, and par- 
ticularly the latter, to construct various scales for variables in a more or 
less independent way, and then following their construction, to examine 
relationships among variables using some form of correlational pro- 
cedure. Thus measurement is seen to be prior to establishing relation- 
ships among variables. However, the demands of Rasch's specific objec- 
tivity can be seen as demands for general lawful relationships (Rasch, 
1977) in which the three aspects — (i) the definition of each variable in- 
volved, (ii) the relationships among variables, and (iii) their 
measurement — are simultaneous. 

On this issue which touches some fundamental epistemological ques- 
tions, but which has only been briefly mentioned in order to indicate 
further possible developments, three final points may be made. First, the 
usual study of constructing measured variables as prior to investigating 
their relationships can still be applied using scales conforming to the 
Rasch speciticatior , and be better because the scales do so conform. 
Secondly, the emphasis on the specification of lawful relationships, with 
measurement being simultaneous to it, is exemplified by papers in Spada 
and Kcmpf (1977) and Kempf and Repp (1977) and seems to be the direc- 
tion taken in the German-speaking countries. Finally, conceptualizing 
rational scale construction or measurement as part of establishing lawful 
relationships again is consistent with Thur.stone's conceptualization for 
altitude scaling which is derived from the psychophysical framework. 



A Rasch mode! for ordered response categories is derived and it is shown 
that it retains the key features, both theoretical and practical, of both the 
Thurstonc and Likert approaches to studying attitude. These key 
features of the latter approaches are also reviewed. 

Characteristics in common with the Thurstone approach are: 
statements are scaled with respect to their affective values; these values 
are independent of attitudes of the persons responding; the scales arc 
rational in the sense that they have interval level properties; the scale 
values, apart from a linear transformation, arc the same as in the pair- 
comparison design and the cqual-appearing-interval design; the model 
)or the data, being an explicit probability model, provides formal tests of 
fit. Characteristics in common with the Likert approach arc: no 
judgment group is required; persons whose attitude is to be quantified 
respond lo statements in the usual way in terms of intensity of Approval 
or Disapproval; while thresholds or boundaries between categories are 
esM mated, the successive categories arc scored with successive integers; 
ihough it has to be transformed monotonically, the attitude of a person 
is characieri/ed by ihe simple sum of the integral scores across the set of 
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statements; concerns with Likert^s Undecided category are appropriately 
manifested in the rating model. 

Further features of this model which distinguish the Rasch tradition 
are considered. Thus it is shown that the more conventional latent trait 
formalizations which were used or broached by Thurstone and Likert for 
ordered qualitative data do not provide the synthesis of the two ap- 
proaches that the rating model ddes7 even though the latter model is 
derived from a very different basis. It is shown that this basis, which 
generates a completely different .set of research questions for estimation 
from that of the conventional approach, is characterized by identifying 
sufficient statistics which can be u.sed for eliminating one set of 
parameters while estimating the others. In this context, further connec- 
tions to both the Thurstone and Likert traditions are made. Finally, the 
possibility of orientating the Rasch tradition for studying relationships 
among variables to one in which measurement is seen as simultaneous to 
constructing lawful relationships among variables, rather than prior to 
examining such relationships, is broached briefly. 
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REACTANT STATEMENT 

Charles Poole 

As a teacher interested in measurement I would be decidedly sceptical of 
any claims that one model of measurement fitted all my various re- 
quirements better than another. Much of my measuring activity takes the 
form of dealing out rough justice to answers submitted in essay form. 
Admittedly the essay form is not chosen for measurement reasons but 
because that form of activity fits with the educational aims of the course. 
However I have found the classical model helpful in warning me of the 
weaknesses of the technique and in developing weighting schemes to im- 
plement policies I have determined for the course. The concerns of the 
Rasch model developers seem mostly far removed from these activities. 
For most such purposes the classical model will do just as well. 

Andrich has made a good start in his paper towards producing an ap- 
proach to scaling which suggests that here the Rasch model has con- 
siderable advantages over the classical model. Not only do we have a 
rationale for the tikert-type scales so beloved by designers of question- 
naires, but he has also provided some links with the Thurstone scalmg 
techniques. It was of great interest to me to be reminded ihat Thurstone 
was aware of the sample-free nature of his scale values and that he 
recognized the power this gives to obtain measures from incomplete 

tests. ^ I r II 

The finding that the so-called neutral category does not neces.sarily tall 
in the centre of the scale neatly demonstrates the validity of the disquiet 
often expressed about this assumption and gives us good reason for 
removing this choice from the offered responses. We need a wide variety 
of studies like this one to make clear the usefulness of the Rasch model in 
various educational settings. 1 believe the effort is worthwhile from a 
teacher's point of view. 

Despite my day-to-day involvement with essay examining, it is as a 
teacher that I appreciate the advantages in adopting the outlook fostered 
by use of the Rasch model. To think in terms of a child moving up an 
ability scale as his skill increases better fits with the modern notions of a 
teacher's task. There is much less stress these days on sorting out the 
sheep from the goats. No one really wants to know who usually comes 
top, nor do teachers want to stigmatize children by placing them on a 
lowiy rank in the class. Most teachers would be delighted to be removed 
from the (yranny of the common examination paper and would ap- 
preciate far more the model which suggests that children should be faced 
with test materials that they find challenging bui not impossible. 

1 share Chopping uneasiness about rejecting items to meet the assump- 
tion of unidimensionality. It would be difficult to accept an attitudinal 
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variable defined by what was left of a set of items after rejection of those 
not fitting the model. The more appropriate action would seem to be to 
question the decision rule which allowed these items into the scale Every 
effort should be made to revise the decision rules so that unidimensiona! 
scales result. At present such revision is better handled by multivariate 
methods not yet developed within the framework of the Rasch analysis 
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Conditional Inference in a 
Generic Rasch Model 

Graham A. Douglas 

The aim of'this paper is to give an overview of the current state of the 
conditional inference argument as it pertains to a class of statistical 
models for measuring latent traits which we have come to term, *Rasch 
models*. There are a number of reasons why we have chosen this more 
technical topic for this conference. 

In the first place, much of what we in Australia do know about latent 
trait inodels, and Rasch models in particular, comes from the American 
and, to a lesser exteal, the English literature. Many of us were nourished 
on a healthy diet ^Wright and others (1968; 1969; 1977) and, while it 
would be unfair tcfclaim that they tend to ignore the conditional prob- 
ability arguments expounded by Rasch himself (I960), it is nonetheless 
true that these arguments and their ramifications are little known or 
understood in this country. An important factor in this relative 
ignorance is that much of the subsequent elaboration of Rasch's ideas is 
to be found in the European literature, and then most of that in the 
German language. 

It is somewhat ironic that, on the one hand, psychometricians working 
in the field have so readily accepted the unconditional argument and its 
related computing algorithms, while at the same time embracing the con- 
cept of ^specific objectivity' or as Wright (1968) would call it, *sample-free 
parameter estimation*, because without some form of conditional in- 
ference argument it is difficult to demonstrate, at least algebraically, that 
Rasch models do possess this property of specific objectivity. Con- 
ditional arguments alone produce probability expressions which depend 
on only one set of parameters at a time. 

A second reason relates to an increasing interest in and preoccupation 
with tests-of-fit of data to Rasch models, and in particular to the power 
of these tests. A debate between Wright (1977) and Whitely (1974; 1977) 
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is an example ot this concern. Recent articles by Gustafsson (1977; 1979; 
1980) have demonstrated lhat, whereas an unconditional algorithm may 
lead very quickly to almost correct parameter estimates in the binary item 
analysis model, the asymptotic properties of the approximate uncon- 
ditional te.sts-of-fit which usually follow such estimation are far from 
known. 

Another reason for looking into this topic pertains to the proliferation 
of a wide variety of statistical models which we might wish to call Rasch 
models. The time appears opportune to attempt an initial generalization 
and synthesis of the principles which underlie the structure of such 
models. Once again this attempt will borrow extensively from the Ger- 
man literature, and especially from the work of Scheiblechner (1971; 
1977), even though we hope to go beyond his developments. I will pre- 
sent the logic and the algebra associated with a general model for 
measurement, parallel its derivation with that of the binary item analysis 
model as an illustration, and then highlight significant details of two 
other models which may be derived from the generic form. Whereas all 
models derivable from our generalization are legitimately models for 
measurement, we may find the logic and even the language far removed 
from the familiarity of the binary item analysis model which we usually 
refer to as the Rasch model of educational and psychological measure- 
ment. The intention then is to be sufliciently general to encompass a c6i- 
lection of modeh; with common characteristics suitable for the diversity 
of measurement problems which arise in the behavioural sciences. 

A final reason for choosing conditional inference is to extend the 
arguments on conditional versus unconditional algorithms by suggesting, 
through example, that not all the numerical problems which have in the 
past been associated with the conditional approach have been solved, 
and that therefore there is ample scope for major developments in 
numerical approximations which will still allow us to stay within the 
rubric of the conditional framework. 

The body of the paper is structured into four main sections: a defini- 
tion of what constitutes a generic Rasch model within the class of latent 
trail models; an algebraic generalization of this definition and examples 
of particular cases; an identification of some major problem areas in the 
implementation of the conditional arguments in practice; and some sug- 
gested remedies and directions for the future. 

DEFINITIOnVf^ RASCH MODEL 
liy now (lie concept of specific objecuvtry-has had suflicient exposure in 
the literature that we come automatically toe^uate it with the models of 
Rasch (!%()). Ir will be valuable for tJje^^fSHowing, however, to re-state 
t[\e principle since other/^vayv«ijyi.4TU5^ view the defining characteristics 
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ol Rasch models are simply variations of or equivalences to this 
principle. 

By specifically objective comparisons among entities, we mean com- 
parisons at a ratio level among a number of entities of a set, such com- 
parisons being uninfluenced in any way by any other entities which may 
belong either to the same class as those being compared or to completely 
different classes. For example, the comparison of the difficulty levels of 
iwo items in an achievement test is specifically objective if the com- 
parison does not depend on which particular population of subjects is 
used to arrive at the estimates of item difficulty nor on the difficulties of 
any other items w hich might happen to be estimated at the same time. As 
Wright (1968) rc-phrased this principle in an oft-quoted analogy: 

it seemed that ... my ability depended not only on which items 1 took 
but on who 1 was and the company 1 kept ... we hoped for easy tests so 
as not to make us look dumb. (p. 85) 

Rasch's emphasis on specific objectivity in his writings (I960; 1968; 
1977) stemmed from his belief that the principle was not only prevalent 
but paramount in the rapid development during the last few centuries of 
laws in physical sciences, even to the extent that one usually takes for 
granted that oncN comparisons of physical entities are of this nature. 
Such is Rasch\ conviction thai the principle is all pervading in science 
that his latest writings (1977) have centred on the common theoretical 
structure underpinning specific objectivity and its variants in all sciences. 
The current direction of research in this area is towards a group-theoretic 
analysis of the concept and some unpublished preliminary work has been 
completed by Borehsenius (1974; 1977). 

Rasch argues that, from the practical point of view, specific objectivity 
allows one to concentrate attention on analysis, estimation, and tit of one 
set of parameters at a time in frameworks which usually include poten- 
tially many other sets of parameter^i. 

An alternative expression used frequently by Rasch is that ot 
Neparabiliiy of parameters'. Separability and specific objectivity may be 
shown to be synonymouv but the former term hints more closely at the 
logical and algebraic properties inherent in models possessing specific ob- 
jectivity. By separability of parameters, we mean that, apart from simple 
and trivial transformations, the pertinent parameters in probability 
models for measuring must have such a structure that they are capable of 
separation into disjoint sets. Since, for example, the discrimination 
paranieier, fv„ in l.ord\ two-parameter logistic model (l.ord and Noviek, 
.1968) always acts upon item difliculty b, and subject ability in a 
multiplicative manner, there is no way to separate difficulty and 
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discrimination parameters in this particular model. This can clearly be 
seen if we write Lord's model in the following form: 

where (i) a, is the item discrimination parameter, 

(ii) 6, is the item difficulty parameter, 

(iii) ji, is the subject ability parameter 

and (iv) .v„ is an indicator variable taking on the value I whenever 
the item is answered correctly, and value zero, otherwise. 

In the following we will refer to ( 1) as the exponential form of the model 
It IS worth digressmg at this stage to emphasize some points concerning 
the chronological emergence of various models and to set the record 
straight on comments like 'the Rasch model is just the simple one- 
paramctcr case of Birnbaum's two-parameter logistic model'. It would 
appear that Birnbaum (1968) adopted the logistic model as a 
mathematical convenience because of some intractable estimation prob- 
lems associated with the two-parameter normal model; in fact many ex- 
positions of the logistic model (to the base e) contain a mul.iplicative 
scaling constant (usually set at approximately 1.7) to bring the logistic 
ogive more into line with the normal ogive. Certainly the one-parameter 
logistic was viewed by both Birnbaum and Lord as the special case of a 
two-parameter model in which all discriminations were set equal to one 
another. 

As far as Rasch was concerned, the logistic form of the model arose as 
a mathematically necessary consequence of his insistence on the principle 
of specific objectivity in comparisons. The base e is purely a convenience 
(any base will suffice for a Rasch model), there is no allusion to normal 
ogives and, although Rasch recognized that lack-of-fit of data to his 
models was a consequence of more than one parameter operating for 
each Item, it is unlikely that he thought of the second parameter in terms 
of discrmunation, and certainly that word does not appear in his I960 
book. Rasch's model was never the consequence of simplifications to a 
higher-order model but the neces.sary result of fundamental measure- 
ment principles, principles of such generalizability that they could be ap- 
plied to measurement situations well beyond those rather narrow ones 
conceived of by many psychometricians working on the other side of the 
Atlani.c; hence the subject matter of this paper, the 'generic Rasch 
model . 

It is but a short step from the relative looseness of separability to the 
greater mathematical precision of additivity of parameters, and it is here 
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that Rasch becomes quite explicit about how the parameters in his 
models mi st be amalgamated. The additivity pertains to models written 
in the exponential form; in (1) above, the item difficulty and subject 
ability parameters appear additively and hence, with all a. set equal to 
one another, we have a model in which all parameters appear in the ad- 
ditive form -a model therefore with specific objectivity. 

Another term which we use interchangeably with the preceding three is 
^sufficient statistic'. Rasch relies heavily on the work of Fisher relating to 
sufficient estimators, conditional probabilities, and likelihood functions 
to the extent that much of what is currently known in the mathemetical 
.^statistics literature about conditional likelihoods in general arose from 
Rasch's interest in them with respect to his models for measuring 
psychological processes. The fact that the conditional probabilities in 
Rasch's models have known, well-behaved, and potentially useful pro- 
oerties' means that the central theorem describing the asymptotic 
behaviour of unconditional maximum likelihood (u.m.l.) estimates may 
be extended to conditional maximum likelihood (c.m.l.) estimates. This 
discovery paved the way for Andersen (1973a) to develop powerful tests 
of the fit of Rasch models to their respective data sets. 

Although it has been demonstrated many times before, the fact that 
Lord's two-parameter logistic model for item analysis does not exhibit a 
geniiinc sufficient statistic for the ability parameter, jS., bears repetition. 
T)ie expression 

does not constitute a statistic, let alone a sufficient one and there is little 
gain in claiming that 7. is sufficient if we know the values of the as since 
nearly always we are forced to try to estimate the a's along with the 6\s. 
In fact, were the ci;*s to be known, Andersen (1977) has shown that there 
can be no daia reduction in using 7. unless all the as are equal and so we 
are forced back onto response patterns -a situation we are trying to 
avoid since this means that, in order to know something about a subject's 
abihty, we would have to retain the complete set of original data on that 
subject and not just the summary which is embodied in the sufficient 
statistic called the raw score. No further information about a subject's 
ability may be gained beyond a knowledge of the raw score. 

It is important in orienting oneself to the concepts involved in Rasch 
models to point out that in these models the conditional inference argu- 
ment is used not so much to identify groups of sufficient statistics for the 
purpose of estimating the parameter sets with which they are associated 
(as is usually the case in statistical models), but more to eliminate or 
condition-out those sets of unwanted or incidental parameters thus clear- 
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ing the way for estimation of another set of parameters, Scheiblechner 
(1977) makes the point abundantly clear when he employs the terms 
*generar and Idiosyncratic' to describe features of the psychological pro- 
cesses which we attempt to model. Conditional inference arguments are 
designed to remove the idiosyncratic features (i,e, individual differences) 
which have tended to handicap the development of our understanding of 
the more general psychological processes. 

r-urthermore, by extending the conditional inference argument fully, 
we are always led to probability statements about the data which are 
completely free of all parameters and hence we set the stage for exact 
probability (non-parametric) tests of fit. With such tests there is never 
any doubt about distributional assumptions or the dubious application 
of asymptotic theory. For Rasch this was the most powerful consequence 
ot adopting the conditional inference stance. 

Separation of parameters (and hence sufficient statistics) arises because 
ot the additive (non-interactive) relationships among the parameters. 
Without this feature the separation does not occur and the conditional 
argument breaks down. It is not coincidental that, for models of the type 
considered by Rasch, the nece.ssity for a conditional probability argu- 
ment could be developed from the work of Neyman and Scott (1948) on 
indclenial und structural parameters leading to inconsistent estimates. In 
the framework of binary item analysis where the emphasis is on item 
estimation, the Neyman and Scott dilemma means that each potentially 
new subject in the calibrating population carries, in his or her response 
vector, a certain amount of information about each of the items, plus a 
new parameter associated with his or her own ability. Thus the number 
of incidental ability parameters could increase without bound. 

We will try to incorporate the various concepts described in previous 
paragraphs into a general model which exhibits all of these properties in 
addition to those properties that we usually demand from any latent trait 
model. Wc will argue that the generic model represents a probabilistic 
dcfmition of ^Rasch modeP in the sense that all models which appear in 
the literature under this rubric are derivable from our generic form, and 
that all models which do not fit into the mould cannot genuinely be called 
Rascli models. 

Our aim in attempting this generalization is not to constrain investiga- ' 
tion of probabilistic models in the social sciences just to those which do 
exactly fit our framework. There may be models which do not conform 
in various ways to the general expression but which nevertheless display 
interesting and valuable properties: for example, the dynamic test model 
ol Kempf (1976). Still, within a well-defined set of assumptions based on 
the concepts under discussion here, a wide variety of models follow to 
which the label Rasch model may be attached. 
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THE GENERIC FORM OF A RASCH MODEL 
The crucial principle of any Rasch model is the way in which the 
parameter structure factors into additive components. Hence any general 
parameter, 0, must be factorable into other undimensional or 
multidimensional parameters which add together in the exponent. This 
restriction confirms, for example, that in the binary item analysis model 
there is rio plaOe for either a discrimination parameter a, since it 
necessarily multiplies (3, and 6,, nor Is there place for Lumsden's (1977) 
person sensitivity parameter ^„ for exactly the same reason. 

Although we are accustomed to thinking in terms of two interacting 
facets in a Rasch model (the test items and the responding subjects in the 
binary item analysis model), we must provide in this general framework 
for any number of facetst Interacting simultaneously. Each facet consists 
of a number of elements and by the term ^interaction' we will mean the 
simultaneous confrontation of one element from each of the facets. For 
example, one marker assessing the essay writing ability of one subject on 
one essay question represents a single individual Interaction in a three- 
facet framework. The totality of observations may be represented in a 
data *cube' of as many dimensions as there are facets. Marginal sum- 
maries of the data cube, either of one or many dimensions, may be 
effected by summing the individual responses across various combina- 
tions of the facets. 

Not all responses are binary. In quantifying attitude questionnaire 
items, for example, we may allow for multiple category responses scored 
with the integers 0,1, . . ., /;/, Instead of the usual 0,1, of binary item 
analysis. Most often the number of response categories is fixed in ad- 
vance, but there are measurement models in which the number of 
categories is open-ended. An example of this situation will be discussed 
in the next section. 

While the basic multinomial random variable in our models always 
represents the response when an individual interaction occurs, we will 
find it more convenient when expressing the model in its statistical form 
lo use an Indicator variable which takes on the value 1 whenever response 
category // is used and takes on the value of zero, otherwise. This means 
thai the basic unit of observation Is a set of (/w+ 1) responses, all of 
which are zero except for the hih element, which takes the value 1. 

With these preliminary comments we are now In a position to give a 
formal statement of the generic Rasch model for measurement. 

(i) A total of t facets of an observational framework are in 

1- f ollouinu Ciiiitman and C>(>nbach, Rasch's torni 't'acior' is avoided because of its ob- 
vious alicrnativc connotaiions in psychology. 

14 j 
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simultaneous interaction such that each individual response, . . ♦ 
may take on a value h in the range /? = 0, 1, . . . 

(ii) /i,/2 /, are index sets for the respective facets and there are N, 

elements in facet s. 

(iii) The data may be arranged in a /-dimensional hypercube such that 
it is possible to calculate 1-dimensional, 2-dimensional, etc, marginal 
summaries of the data. 

(iv) ^,,,2 . . . ,,h is a general function of parameters which is factorable 
into w additive component parameters as follows, 

where the subscript j in /a, stands for a sub-set of the indices /u/2 /i. 

pi, represents a continuous latent trait (unknown property) of facet com- 
bination j and is manifested in the observed data, 

(v) x.,,, is an indicator variable, taking on the value I whenever 
response category h is utilized by facet combination /1/2 , , , /„ and taking 
on the value zero otherwise, 

(vi) A'.,,, , . , ,^ is a random variable whose observed value is a.,., 

(vii) All interactions are stochastically independent, conditional on 
the parameters in 0, 

With these notations we may write the probability of the random 
variable A',,,, . . . ,^ taking on the value a,,., .... in terms of a general ex- 
ponential function of the parameter d and the indicator variable a,,,, . . , 
.,H. as 

^1 I J T • • • A', J . , , j^;, 

PIA-.,,, . . . ,-A'.,., . . . J-e''^" ' ' (2) 

e 

Example I: Binary Item Analysis Model 

The familiar binary item analysis model arises b> making the following 
changes to symbols in (2). 

(i) /| - V and iy - i 

with v== 1, . . A' subjects and / ^ I, . . A' items. 
/i-0,l. 
(ii) Set 0,^,^^ ^ + /i2 
with ix\ 
and III - 

^1 is the (latent) ability parameter of subject v 

6, is the (latent) item difficulty parameter of item /. 



Conditional inference in a Generic Rase ^ Model 1 37 

(iii) Sei,v,.>,= I whenever subject v gets item /correct, and =0 otherwise. 
Then the generic model takes the form, 



We must now demonstrate that the property of specific objectivity 
follows from this model. In order to do that we must derive the (uncon- 
ditional) joint probability of all the data, the joint probability of 
marginal statistics suitably identified, and the conditional probability of 
one marginal set, conditional on all others. In order to know which 
marginals to consider in the conditioning, we must select a parameter set 
which is to be the subject of current investigation; for the binary item 
analysis model, for example, we usually focus attention first on 
calibrating a fixed set of items, in which case the subject parameter set, 
(13), is incidental. On the other hand, when the emphasis is on measuring 
the abilities of a fixed number of subjects employing items from an item 
pool, the item parameter set, (6), is considered to be incidental, A 
measurement perspective is necessary before adopting the conditional 
probability argument. 

As an appendage to the mainstream argument we will demonstrate, in 
the manner of Rasch (1960), that completely parameter-free tests of fit 
follow by extending the conditional argument to its limit. 

The conditional probability (likelihood) of the total data set is given by 
the continued product of the probabilities of the yViyV2 . , , yV, individual 
responses and may be written as 

1. - P\X W . . . 1 - A'i : I, . . A^v,v, . • . vt ~.Vv,v, . . . v*l 

(3) 

H [I u i: 

We have already replaced d by its appropriate factorization since the 
structure of 0 is determined by the model builder and not by any of the 
algebra of the derivations. 

It will always be possible to re-write the numerator of this expres.sion 
in such a way that the set of sufficient statistics for the .set of parameters, 
ifi,), is clearly indicated. We will write (,v,,) for that data summary 

11;' 



ERIC 



138 



The Improvement of Measurement 



(marginal) which arises from summing .v,,,, . . . .^^ over h and over all in- 
dex sets which, are not included in y\ Hence we have 



Z.= ^ (4) 

n n ...n i: e 

In the Binary Item Analysis Model, 

^1 contains the single-dimension parameter (3,, 
fi2 contains the single-dimension parameter 6,, 

and therefore 



n 0(1 -f 

v=l .=1 



where we usually write r, for a^,* and s, for a%.. 

Since (4) is written in terms of all relevant marginals, x*2j . . a%h., 
and thus does not contain the original data explicitly, the joint prob- 
ability of all marginal sets is simply C times (4), where C is the number of 
possible complete data sets which could have produced exactly the 

observed marginals, a*m, a%2 v^h.. (C is a combinatorial number for 

which no algorithm, other than listing, has as yet been developed,) The 
probability is written as 

U = PHx.^)Ax.2) (A%J| (5) 

As an aside to the main argument on specific objectivity, but of para- 
mount relevance to testing of fit, we may demonstrate another property 
of the conditional inference procedure. Using (4) and (5) we may write 
the conditional probability of the observed data set given all the 
marginals, as 

L*—P\X\ \ . . . 1 ~ A*i I . . . 1 , . . A'vjv, . . . V, 

^•v,v, ... V, (,V„),(,V,:) (A%J| = ^^ (6) 

This probability (likelihood) is free of all parameters in the model and its 
value as a tool in testing fit will be commented upon in a later section. 
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Returning to the main devclopmcnl, we now focus attention on one 
component, say (/i,), and its associated marginal, (jc*,), in order to find 
the marginal probability of all marginal sets other than (x^,). We do this 
by summing the unconditional probability over all possible values of {x\,) 
which arc compatible with the remaining marginals, {x^\), (x^z). . . 
(.V*,.,), (a%,h), . . U.J. 

Hence L = P!(a'.,),(a'.:), , , „ (a%,-,),(a'.,.,), . . (a'>J! 

-i:r* L 

all (a%*) such that all other marginals arc fixed, 

(7) 

r/i,A\, f i:;x:A\, f ... -f i: /x,.,A%,-,+ M,hA-.,h -f ... i:/Xh.v. r I 

ri II ... .11 i: e ' 

where a symmetric function in the parameter set (ft,) is defined as 

all (a**) such that all 
other marginals are fixed. 

In the Binary Item Analysis Model, with attention focused on the estima- 
tion of the item parameter set, (6), we have to determine a symmetric 
function in the set, (6), which is defined as the sum over all possible item- 
count marginal sets, (5*), which are compatible with the observed raw- 
score set, (r). In practice a number of different binary data matrices, (C* 
of them), will result in the same set, (5*), and this number C* must be in- 
cluded in the definition, as shown, 

all iS*) such that (/) is fixed. 



Then 



n ri (i f c'^^"^) 



l inally we need the conditional probability of the marginal set of in- 
terest, (v.,). given all other marginal sets which we are attempting to 
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eliminate. This conditional probability is obtained by dividing (5) by (7) 
to obtain 

L^=P\{X\,)\{X.,), (A%2) (A\,-,), (X.,.,), . . , 

^>Av (8) 
. Ce' 

which is dependent only on the parameter set, (^,),and not on any other 
parameters in the model. 

In the Binary Item Analysis Model, upon division of the joint prob- 
ability by the marginal probability, we have the conditional probability 
of the set of marginal item-counts, given the set of raw scores, as 

Ce 

This conditional probability depends only on the item parameters and 
not on the subject ability parameters. 

To complete the algebraic story, we next set the vector of first 
derivatives, with respect to the /a's, of the log-likelihood, equal to a zero 
vector to obtain a set of conditional maximum likelihood (cm. I.) 
equations. 



dL 



d\ny^ Apt,] 



(9) 



Upon solving these equations, we insert the cm. I. estimates into a matrix 
which is the negative inverse of the matrix of second derivatives of the 
Ing-likelihood, thus arriving at 



3Mn7(r,^,[M>] 



(10) 



which is an N, x yV, matrix of estimated error covariances and from which 
the asymptotic standard errors of the /I/s are obtained by extracting the 
square roo^s of the diagonal elements. 

In the Binary Iterii Analysis Model, the maximum likelihood equations 
and (he error covariance matrix are given by 
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dL 
l6 



= -S- 



ad 



= 0 



and 




,(5) 



A perhaps more familiar but algebraically identical version of the cm. I. 
equations presented above, as might be found in Andersen (1972) or 
Wright and Douglas (1977), takes the form. 



where (i) //, is the number of subjects with a raw score of r, 

(ii) 7, is the nh order symmetric function in the set (6), defined as 



all response vectors (a *) 
such that r, is fixed, 
(iii) 7,.u, is the (r~ l)th symmetric function in which all terms in- 
volving 6, have been removed. 

The other two examples which follow are presented in the style of 
Andersen (1972, 1973a). The difference between his approach and that of 
Rasch which we have adhered to is that, for the binary item analysis 
model, Andersen derives the conditional probability of the response vec- 
tor for a single subject (conditional on that subject's raw score) and then 
arrives at the conditional likelihood of the total data matrix by taking the 
product of the individual likelihoods over all /V subjects. This procedure 
produces a different likelihood, avoids the use of the t rmbers Cand re- 
quires the calculation of separate symmetric functk.is for each raw 
score. Despite the ditTerent likelihoods, the cm. I. equations are the same 
as those of Rasch and thus we are led to the same parameter estimates. 

In addition to the binary item analysis model and the two models yet to 
be considered, some other models which fit into our framework and 
which have received varied attention in the literature are: 

(a) the speed of oral-reading model of Rasch (1960); 

(b) sociometric choice models, Scheiblechner (1971); 

(c) the multi-dimensional questionnaire model, Andersen (1972); 

(d) the measurement-of-change models, Fischer (1976), which in- 
troduce the facet of time\ 



/=1,A: 
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(e) the linear logistic model, Fischer (1977); 

(0 the multiplicative binomial model, Andrich (1978a), Douglas 
(1978); 

(g) the grader/subject/item model, Malone (1980), 
Example 2: The Rating Mode/, Andrich (1978b) 

This model is one of an hierarchy of models derived from the 
multidimensional model of Andersen (1972). Starting from a general 
parameter 6.,^ which may be written as 

(with V, / and // taking their usual meaning), we further factor this mto 

Andersen identities the 0's as the \scoring' parameters and the k\ as the 
Category' parameters. This model appears pertinent to any rating situa- 
tion in which a scries of items (or more generally, questions), each permit 
a response on a scale which may be quantified from 0 to /;/, Andrich 
(1978a) and in his paper prefers to work with a simple transformation of 

h 

hh^ - L 7, (/!= 1, . . ., m) 

f'\ 

He calls the r's the ^threshold' parameters, after Thurstone, 

In order for a genuine Rasch model to eventuate, we must set the scor- 
ing parameters equal to consecutive integers, that is 

0, = //(/2 = O, 1 m). 

Only by doing this will we have a genuine sufticient statistic, 

for (he structural (item) parameters, thus permitting their elimination 
from the conditional likelihood. As Andersen (1977) and Andrich 
(1978b) have frequently pointed out, such a ^restriction' is very much in 
keeping with the notion of integer scoring advocated by Likert (1932) 
and does appear appropriate for a wide class of data sets of the rating or 
attitudinal type. 

There still remains the question of the identifiability of the category 
parameters, hh- Some simple algebra will show that since the marginal set 
{(A'v.J)-the number of times that subject v used category h- h sufficient 
for the parameter combination set ((hlS,'^' hh)), the category parameters 
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are well defined and are permitted in the model since they will be 
elimmated (along with the subject parameters) when the conditional 
likelihood is obtained. With these comments it should be clear that the 
probability of an individual response may be written as 

and that the conditional likelihood for estimating item parameters 
follows as 



Although we have as yet made no mention of numerical problems 
associated with the estimation of the parameters in any of the models, it 
should be noted that, in applications of this model to real data, Andrich 
at least has used, an estimation algorithm based on the unconditional 
likelihood, and then has 'corrected' the estimates to bring them more into 
line with what would arise had the conditional likelihood been correctly 
used; more on these problems later. . 

Example 3: The Rasch (1 960)/ Andrich (1973) Essay Grading Model 
A perennial problent in the grading of extended response answers (such 
as essays), when more than one grader is.involved in the marking, is the 
question of the varying grader harshnesses which result in comparisons 
among subject essay-writing abilities which are highly suspect One 
avenue out of this dilemma has been to train graders to such a level of 
consistency that we are prepared to accept all graders as virtual replica- 
tions of one another. Problems of marker reliability are then assumed to 
have been controlled. This training, however, is never very satisfactory 
and, even moj?e importantly, by trying to force all graders into the same 
mould, we |^ose potentially important information about the psycho- 
logical processes involved in essay marking as reflected in the very in- 
dividual idipsyncracies that we are trying to eliminate. It is preferable to 
control grader differences but still retain information about them than to 
throw away that information altogether. 
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Andrich devised a model in which grader harshness was explicitly 
parameterized; because the model was a Rasch model and the specific ob- 
jectivity property held, it was possible to estimate subject essay-writing 
ability independently of the particular grader involved in the marking. In 
this example we see a straightforward application of the strategy of iden- 
tifying a sufficient statistic for the fundamental purpose of parameter 
elimination. 

The model adopted by Andrich had formal similarities to the one for 
errors in oral reading described by Rasch in his 1960 book. In that 
development and in the thesis work of Andrich, the Poisson distribution 
was the starting point. We will show now that by means of simple 
transformations this model arises naturally from our generic form. 

In the first place we assume that graders are permitted to detect an 
unlimited number of errors in a subject's script. This direction to graders 
to use an open-ended scoring scale is equivalent to letting m tend to 
infinity in the list, // = 0,1, . . ., m. 

Furthermore, since the basic random variable in this model is the 
number of errors detected, the subject ability parameter jS,, should enter 
the model with a negative sign if we are to make the same interpretation 
of it as we have done in other models. With these minor variations and 
with 1]^ as the grader parameter, we may write the probability of an in- 
dividual response when grader ^ assesses the script of subject v as 

[ln(^) + //(r;,-/i.)K,. 

£ ^ln(^) + /7(r,,-/l) 

Let us consider further elaboration of the model from the perspective 
of estimating the grader harshnesses, r],. It is not difficult to see that a;,** is 
sufficient for and that x,^^ is sufficient for n^. The conditional 
likelihood is written as 

P| A', , - A , , , A'l : A-, 2. . . . , A'v, - A'v, (AV.OI 

^ c; 

i: ln(^)v..,+ L r;^v.,. 

For those of us lor whom siaiislics are quite often a mystery, the rela- 
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lionship between this likelihood and the Poisson distribution appears 
rather remote. However, it' we make the following iransforniations, 

A,=^e'^^ 

and - c^^^ 

and note, through a hrst-prineiples definiiion of the exponential rune- 
tion, the infinite sum in the denominator, 



eon verts to 



I ,,ln{7^) + /K/i.-5,) 



which is jus( ci delinition of 

e^ h B. 

and furlhermore ihai the obtuse expression 

i: ln( )a;,,. 

may be re-v\ritten more simply in terms of faetorials as 

1 

a;, J ' 

we arrive hnaliy at an equivalent fortn of the model. 



a;, J 

This distribution is direetly Poisson with parameter \ Aj^/B,, Despite 
ifie faet that it is Poisson rather than logistie we may s^ll apply our 
marginal and eonditional arguments, by whieh we are l/d to the con- 
ditional likelihood. ^ 



II A\,.i n 7, .4/1] 



-I 
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Expressed in this manner, even the symmetric functions may be shown to 
have a more favourable form, in that we may write 

Thus the most transparent form of the conditional likelihood is 



f' A*,* > 

i(=\ v=| 

i; A, n A\,j 

^=1 «=i 



Andrich (1973) gives details of estimation and tests of fit and shows that 
the conditional and the unconditional likelihoods give identical 
parameter estimates. One should not, of course, take this as a sign that 
there arc many other models for which this is true. 

NUMERICAL ANALYSIS PROBLEMS 
ASSOC IATED WITH CONDITIONAL INFERE, ICE 

It is one thing to produce a mathematically rigorous statement of prob- 
abilistic models involving many parameters, and another thing to devise 
cHkient and accurate numerical methods to answer the kinds of practical 
questions we pose about the operation of the models. To a certain extent 
these arithmetic problems have curtailed the widespread use of Rasch 
models among social scientists; on the other hand, their apparent intract- 
ability has led others to adopt approximations the validity of which is 
IrequentK ui^knowti. 

in the development of new techniques, however, initial caution must 
eventually give way to guarded extension if the models are to have accep- 
tance in a wider sphere. Often this means that either assumptions are 
relaxed or approximations are introduced, it is the latter option which 
has lound most favour with respect to Rasch models. 

It would be naive to imply that the only remaining problems of latent 
trail models arc those associated with numerical methods. However, tor 
the current exercise, we will address ourselves U) some of these problems 
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in the hope thai, if the problems are well identified, iheir solution will 
follow a little more easily. Normally we would not concern ourselves 
with arithmetic problems like these at a seminar of this nature but, since 
they are intricately related to the conditional inference argument, their 
discussion is quite relevant, particularly when crude approximations are 
used to verify the powerful properties of Rasch models. In a 
characteristic understatement of Rasch {I960): 

the practical applications of the theory present, however, difficulties yet 
to be removed. As they are 'only' algebraical and computational, we 
nia> hope for a satisfactory way out. (p, 122) 

We turn now to a brief discussion of four of these problem areas. 
The J4>mmetrit Functions 

The number of terms in a middle-order symmetric function for binary- 
item tests of even moderate lenuth, say 50 to 60 items, is astronomical; 
lor example, the symmetric function 72^(6] in a 50-item test has in excess 
of 12.6 ' 10' ^ terms. Our computers are fast, but there are limits. Even 
granting tluii these calculations could be made, there still remains the 
problem that ilicre is no closed explicit form for a symmetric functioti. 
All algorithms which determine them, or ratios of them, use a recursive 
expression which builds up each successive function from those deter- 
mined previously in the list. Fhis obviously reduces the actual number of 
calculations involved, but does introduce another more damaging com- 
plication — rounding error. 

The calculation of the exponential function by a computer necessarily 
means retainmg only a fmite number of significant figures in each calcula- 
tion. If onlv a few calculations are being made, by using double precision 
aiithmetic there is usually no problem about rounding error. When 
deternnning s\mmetnc functions of the order we are describing, 
however, this rounding error can accumulate drai:iatically to the extent 
that negative estimates of the functions arise persistently, causing the 
estimation algorithm to abort. Hence there are no conditional estimates. 

(iustafsson (1979) claims to have solved the rounding error problem by 
utili/mg ti number of previously unused recursive relationships among 
the successive svnimetnc (unctions. By employing these and a number of 
otiier expedient devices, Ciustafsson has written a program for which 
conditional estimates can be delerniined for binary-item tests of up to 
about 100 Items, as long as not too many of the items are at the extremes 
ot tlic dil!icult \ range. I 'niortunately we do not have prior knowledge of 
jus! Ikuv many or how extreme are the ijems in order to predict whether 
or not the program will abort. More crucially, there will always be the 
suspicion (hat. aliliouuh rounding errors have started to Set in\ they are 
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not yet ot* such a magnitude to abort the program. What then of our 
'estimates'? Even if the estimates are to be believed, we might still ques- 
tion the effort involved. 

Standard Errors 

Another numerical problem arises in connection with the algorithm used 
to solve the c.m.l. equations. If the raison d'etre of the exercise was the 
calculation of the c.m.l. estimates only, then there is little doubt that one 
would prefer variations of the so-called 'switching' method to the multi- 
parameter version of the Newton/Raphson method. Briefly the switching 
method for binary item analysis involves solving for 5, from the equation 
which arises as a re-arrangement of the c.m.l. equation: 



Although more iterations are required to achieve convergence with the 
switching method, ^t least a A' x ^ matrix of second derivatives does not 
have to be inverted at each iteration, as is requiired with the multi- 
parameter algorithm. But the purpose of the exercise is certainly not just 
to estimate parameters; it should involve also the determination of the 
standard errors of those estimates as well as tests of fit. 

If we are to adhere to the principles of c.m.l. estimation, then the most 
appropriate standard errors will be given by the square roots of the 
diagonal elements of the matrix in (10). If we use the switching method 
and avoid inversion at each iteration, the inversion would still be 
necessary after convergence in order to extract the standard errors. The 
implied criticism by Gustafsson (1979) of unconditional procedures for 
parameter estimation is not levelled consistently since unconditional 
standard errors are used by Gustafsson in his programs. Elsewhere there 
is evidence, Douglas (1978), to suggest that conditional and un- 
conditional standard errors are not the same and, unlike the estimation 
of the parameters themselves, we do not have known correction factors 
which enable us to change from one form to the other. 

More Complex Models 

If the numerical problems are not wholly controlled in the binary item 
analysis model, there is little surpri.se in finding that for more complex 
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Rasch models the situation is quite unclear. For polychotomous models 
for rating, for example, little is known about how to arrive at the c.m.l. 
estimates (and their standard errors), other than in relatively simple cases 
where the structural parameters are few in number. Unfortunately the 
identification of this situation as a problem area is camouflaged by the 
fact that, in the published literature, most examples limit the number of 
structural parameters and one could be excused for believing that 
numerical problems are totally non-existent. Yet in practice we are likely 
to be dealing with large numbers of structural parameters. As we increase 
the number of response categories — and five categories is certainly not 
uncommon— the number of terms in each symmetric function also in- 
creases beyond what it would be for the binary case. Although correction 
factors are applied to unconditional item estimates in the polychotomous 
model of Andrich (1978a), no published studies are available as there are 
for the binary model (Wright and Douglas, 1977) which detail 
thoroughly the circumstances in which the corrected unconditional and 
the conditional estimates are similar. Naturally the problems of matrix 
inversion and the standard errors are also magnified in these models. 

The advisability of trying to find corrected estimates becomes more 
questionable the greater the number of sets of parameters in the model. 
For example, in a B-facet model, the unconditional algorithm must 
estimate simultaneously three sets of non-linear equations and must keep 
a check on not only the convergence within each set but also on the 
overall convergence to ensure that the complete likelihood is maximized. 
Once correction factors have been applied to one or more sets of 
parameters, the likelihood is no longer maximized. 

Kstimatin{{ other Sets of Parameters 

One advantage of viewing these models in their most general form is that 
we are less likely to become fixated on item analysis to the detriment of 
subject analysis. The area of subject ability analysis in a number of 
models where it is relevant has received virtually no attention in the 
literature. If we are to follow the spirit of Rasch and our generic form, 
the subject ability parameters would be seen as structural parameters in 
the presence of the incidental item parameters when the items for a test 
have been selected from a bank or pool. In that case, the focus is on per- 
son measurement and we are at liberty to vary the number of items 
administered. 

All of the arguments used previously to derive conditional inference 
slatcmenis with respect to item parameters may be employed in a directly 
parallel manner to derive conditional inference expressions for the sub- 
ject ability parameters. Even though the number of subjects being 
measured simultaneously by a test may be thought of as fixed, this 
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number c^uld be quite large. Hence all numerical problems which we 
have identified with item estimation are often intensified with conditional 
subject estimation; one only has to note that for items, there are (?) terms 
in the nh symmetric function but for subjects, there are (^^) terms. 

While it is rare that we would wish to calibrate one or two items at a 
time, it is not uncommon to find contexts (mastery learning, criterion- 
referenced testing, tailored testing) where one person is to be measured at 
a time. This situation raises a whole series of conceptual as well as 
numerical problems when conditional inference i: *^mployed, since it is 
patently impossible to estimate the ability of a single subject condition- 
ally without recourse to the ability of a reference subject — and hence we 
are back to norm-referenced measurement. It is beyond the scope of this 
paper to delve into these problems but the dilemma does highlight once 
again that the models of Rasch are models for comparisons and that ab- 
solutes really have no place here. Claims to the contrary are false. 

We may identify at least two procedures (in addition to the con- 
ditional) which have been used to arrive at subject measurement in the 
binary item analysis model. According to Andersen and Madsen (1977, 
p. 359), *the logical implication is to base the inference concerning the 
/i/s on the remaining part of the likelihood*. Since unconditional, 
marginal, and conditional likelihoods are connected via the relationship. 



be used to estimate the ability parameters. Although the symmetric func- 
tions contain no fi\, the denominator does involve item parameters as 
well as rf\ and it is cu^^tomary to replace the 5/s by their cm. I. estimates. 
On the other hand, Wright and Panchapakesan (1969) and Andrich 
(1978a) base the inference on the unconditional likelihood, L^. Although 
the likelihoods are different, both approaches produce the same m,l. 
equations for the /3's and Wright substitutes the corrected unconditional 
5's rather than the conditional ones. Given that for a very wide cla.ss of 
binary item analysis examples the corrected unconditional item estimates 
are virtually identical to the conditional ones, the approaches of 
Andersen and Wright should coincide. However, since correction factors 




Andersen is advocating that the expression 
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are not well established for models other than the binary item analysis 
one, we should not expect coincidence of the Andersen and Wright 
methods In other models. 

We might note also that the inconsistency of the unconditional item 
estimates which leads to their values, being exactly twice those of 
the conditional estimates, in the case of two items, is directly 

duplicated when attention is turned to estimating the abilities of two sub- 
jects. Although to our knowledge no confirmatory studies have been car- 
ried out, we might surmise that an Identical correction factor which 
converts unconditional estimates to approximate conditional ones for 
items, 

k 

also operates in converting unconditional estimates to approximate con- 
ditional ones tor subjects, 

S - I - 

Clearly this correction factor is not insignificant v. hen we are measuring a 
small number of subjects. 

With respect to ihe standard errors of subject ability (the equivalent of 
what psychometricians would refer to as ihc precision of measurement), 
both the Andersen and Wright methods lead to approximate expressions 
not invoUing the inversion of matrices and once again we have little in- 
torniation about whether these are under- or over-estimates of the error 
ot measurement. 

SOMl: DIRF-C IIONS l-OR THE- FUTURL 
There appear lo be a number of possible ways out of the dilemmas of 
numerical analysis as outlined in the previous section. The most 
straightforward but uncompromising solution is to follow Gustat'sson's 
example and attenipi lo improve the algorithms for calculating directly 
the svmmetric f unctions for all models. We tend to doubt the advisability 
ot this action tor models other than those on which it currently works 
since the numerical problems are inordinately complex. It is not un- 
common, for example, to tind oneself working with attitude question- 
naires o\ the 1 ikeri-i\pc (with live response categories) consisting of 
something like 50 (juesiioiis. Since in this case raw scores range from zero 
to two hundred, it is impossible (o analyse these data conditionally with 
the algorithms preseniiy available. Other alternative solutions must be 
sought. 
f- 
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An alternative which suggested itself to Rasch himself in the early 
seventies (personal communication) was to find numerical approxima- 
tions to the symmetric functions along the lines of the formula known as 
^Stirling's approximation' for higher order factorials. These approxima- 
tions would take the form of explicit expressions for symmetric function 
ratios of all orders. Other than initial skirmishes with the problem, little 
development appears to have taken place in this direction, 

A potentially promising approach is oft'ered in related work on exact 
probability tests being carried out by Agresti and his colleagues (1977; 
1979). The approach offers advantages not only for parameter estimation 
via the likelihood equations but, more importantly, for carrying out 
exact tests of fit. Before outlining Agresti's approach, we should say 
something further about the tests of fit employed in Rasch models. 

The applicability of any model derivable from our generic form rests 
substantially on the assumption that the model fits the data to within ac- 
ceptable probability limits; in that case, all the properties of the model on 
which we place so much importance must follow necessarily. Viewed in 
this manner, the determination of fit precedes in importance the deter- 
mination of parameter estimates to the extent that an understanding of 
the psychological processes underlying the interactions, which give rise to 
our data, comes from our assumption that we have the correct model. To 
talk of Rasch models as 'providing specific objectivity' is to understand 
that these properties obtain in the presence of the model fitting the par- 
ticular data set. Without fit we really have very little to talk about. 

Complications occur when we realize that data fail to fit probability 
models for many reasons and that it is highly unlikely that we will find a 
statistical test which will detect lack of fit against all possible alternative 
models. A test which is suitable for detecting unequal item discrimina- 
tions in the binary item analysis model, for example, may have virtually 
zero power for detecting other departures from that model (i.e. a model 
with equal item discriminations but unequal person sensitivities). At the 
other extreme we have a problem which is constantly with us, that of 
sample si/e: if we manage to collect enough data pertinent to our model, 
any test wc use gains sufticient power eventually to reject the model 
against every alternative hypothesis and we conclude that no data will 
ever tit the model. 

This is not the place for an extended discussion of the question of tests 
of fit. Gustafsson (1977; 1979) has written extensively on thi.s topic in re- 
cerit articles, where he raises some fundamental questions about the 
power of the approximate chi-square tests of fit many of us are ac- 
customed to employing in our Rasch model programs. What concerns us 
here is that one of the reasons we use these approximate tests (apparently 
without knowledge of their statistical power) is that, despite our 
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awareness of the cxisicncc of the more powerful tests, the operation of 
the former does not depend on the calculation of the complete likelihood 
and consequently the symmetric functions. 

There is no doubi that we do know the theoretically correct path to 
follow . According to Andersen (1973b), 

the main result so far on conditional inference is that a uniformly most 
powerful unbiased (U.MP.U.) test for a composite hypothesis can be 
constructed from the conditional likelihood, 

In practice this test takes fhe form of a likelihood ratio test in which the 
sialisliL 



is distributed as chi-square on (G \)(k-\) degrees of freedom and 
where 

(i) / . IS (he log of the complete conditional likelihood as derived in (8), 

(ii) i., is the log of the conditional likelihood for the subset of the 
data, where (i is chosen such that the number of observations in each 
subset is sutticicntly larue to warrant the assumption of asymptotic 
iheorv. 

Although the combinatorial number (^'disappears, the item parameters 
have to be estimated for G > 1 data sets and, of course, the conditional 
estimates must be used. 

There is also no doubt that in many instances we are simply not in a 
position to apply this (est, either because we are unable to calculate the 
symmetric func^on^ (and hence the likelihoods) or because the sample 
si/e is so sm;ill that asymptotic theory is of dubious validity. The purpose 
of highlighting (his problem is not to exhort researchers to drop their ap- 
proximate tests of fit, but to induce a healthy scepticism and caution 
when using these approximate tests with the anticipation that, when the 
numerical analysis details are worked out, we will be able to operate the 
conditional tests in all circumstances. 

rhe implementation of exact prahahility tests, on the other hand, re- 
quires no assumptions about distributional shape, parameter estimation, 
or large sample sizes (l isher, 1934). As we have noted in equation (6), an 
exact test of fu of data to a Rasch model is theoretically possible since we 
have a conditional probability statement completely free of all 
parameters in the model. This enables us to control the model on the 
basis of the observed quantities alone since no parameters have to be 
estimated, ideally we would calculate the probability of the observed 
data, given the marginals (which we know to be equal to l/Q and the 
probability of each other possible data set with the same marginals (each 
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of which also happens to have probability of l/Q which tend to Favour a 
hypothesis which is an alternative to the null one of specific objectivity. 
These other data sets are said to be 'more extreme'. The sum of all these 
probabilities would then be compared with the probability of a Type 1 
error and conclusions about fit follow. 

What prevents us going ahead as described above is the calculation of 

the combinatorial number, C the number of possible 0,1 m data 

matrices, A direct frontal attack on this number seems futile, even 
though the answer is known when we do not restrict our observations to 
a pre-determined maximum of m\ however, what Agresti recommends, 
in a related context, is a sampling of a relatively small number of the 
large number of possible matrices. With a high-speed computer the prob- 
ability of a Type 1 error could be determined to any degree of accuracy. 
Sampling of both symmetric functions and matrices appears a possibility 
so that the technique might be employed for estimation as well as testing. 
These ideas are in their formative stages only but they do appear to offer 
one way out of the dilemma and will possibly pay strong dividends for 
someone interested in starting an investigation along these lines. 

By now the reader will have been prompted to ask the question, 'Why 
use tfic corrected unconditional approach in both estimation and h\T If it 
were possible to t'md the appropriate correction factors for parameter 
sets in all models, the problem of estimation would no longer be with us, 
even though we still see no way of getting around approximate expres- 
sions for standard errors. But this still leaves the tests of fit since 
Andersen's test requires the calculation of the log-symmetric functions. 

Whereas we must agree with Gustafsson's (1980) exhortation: 
'whenever it is judged important that goodness of fit is evaluated with 
sound methods, the cm. I. approach should be used', we see no disadvan- 
tages accruing from a strategy which takes the corrected unconditional 
esiimatcs (involving no calculation of symmetric functions) and using 
them in tfie conditional likelihood (involving a single calculation of the 
symmetric functions). Our stance is to make use of the best of all 
available methodology to arrive at solutions whose rigour is unques- 
tioned. An increased awareness of the importance of fitting is certainly 
an encouraging sign in an area prone to ad fioc approximations. I'urther- 
niore an emphasis on questions person Jit is equally timely and opens 
up possibilities onlv previously hinted ai (Leunbach, 1976; Wright and 
Stone, 1979). 

C*()NC'l,USI()N 

M\ aim has been to review ihc central place of conditional inference in 
the theoretical and practical operation of a class of latent trail models 
which we label as Rascfi models. The pedagogical stance has been one of 
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recognition of the technically correct procedures to adopt both in 
parameter esiimation and hypothesis testing about fit, combined with a 
cautious use of numerical approximations where applicable. 

A number of avenues have been hinted at for the future direction of 
numerical analysis problems, all of which approximate the conditional 
algorithms. I have stressed the crucial asp>ects of tests of fit. In particular, 
I hope that those using approximate tests will temper their claims for 
*good tit* with statements which acknowledge two fundamental facts: *fit' 
i,s never fully determined by a finite set of tests; and information on the 
power of tests adds credibility to such claims. 
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RliAC rANT STATEMENT 

Alan G. Sfnith 



Dv Douglas's paper is a valuable one on several counts. Firstly, it makes a 
coniribution towards a geticrali/ation of the model, in bringing together 
R'Asch developments in several areas; and, in doing so, the paper ad- 
dresses tlie problems of iioodness of tit tests and the power of such tests. 
Secondly, the paper reminds us of some essential features for 'specific ob- 
jectivity' in the measurement of attributes and, in particular, reminds us 
uhy we should restrict ourselves to only one item parameter. Thirdly, the 
paper reminds us of the real complexity of modelling psychological con- 
cepts such tliat wc have botli mathematical completeness and at the same 
lime workable formulae to apply to real data, I wish to take up the last 
point briefly, and to make one or two other comments. 

Despite the great theoretical attractions of the procedure, Dr Douglas 
has not overstated the ditficuliies inherent in utilizing symmetric func- 
tions with the model. This raises for me the general issue of the potential 
gulf betvseen theoretical development and real-world applications of a 
statistic. I he Norton studies of the behaviour of the F statistic (Lind- 
quist, 1953) many years ago sliowed that that statistic often provides 
good information under conditions where it could be expected to fail; 
again, colleagues will be very familiar with the robustness of the Pearson 
product-moment coctficient with data which often greatly abuse its 
assumptions. Similarly, in the case of the Rasch model applied to binary 
item analysis as put forward by Wright and Panchapakesan (1969), we 
find thai several of tlie tlieoreiical problems with the model may not be as 
siunificani as Dr Douglas's paper would suggest, Kvidvnce is ac- 
cmnulaiing ihat tlie assumption of uniform item discrimination is not 
nearly as viial to tlie performance of the model as writers such as Whitely 
and Dawis (1974) would have us believe (Dinero and Haertal, 1977; 
Smith. 1978). it would appear, too. that practical use of the model in 
pcrson-ahiliiy estimation is not significantly aflected by the results of 
goodness of tit tests (Sniiili, I97S), Again, we find that the model works 
ucll uiiii quiic small samples (linsley and Dawis. 1975) even though it 
uses a very large sample statistic (Whitely and Dawis, 1974), Finally, 
although Wright and Douglas (1 977) show that the 1969 Wright and Pan- 
cluipakesan maximum Nkelihood procedures are biased, they also show 
that the c\(cni of the bias is of minimal practical importance, especially 
uhcn compcucd uiih (he alte^nali^e estimation procedure. Thus the 
merits of neu compuiationally complex procedures which solve 
theoretical problems of goodness of lit and so on, albeit rather nicely, 
\m!I have to he ver\ clearly demonstrated; (his is especiallv true given that 
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it has taken 20 years lor ihe Rasch model lo reach its curreni stale of 
limited aeeepiance. 

The work of developing the model and spreading its good measure- 
ment news is also hainpered by problems of terminology and com- 
tiuinieaiion. 1 am not a mathematician and am uncertain of the 
correctness of my understanding on this point, but there seems to be a 
real problem in the use of 'conditional' versus ^unconditional* estimation 
procedures here. Douglas's use of the term, which is fully explained in 
the 1977 paper with Wright, is quite ditlcrcnt from the earlier use of these 
terms in the latent trait field. The originators of the terms were Bock and 
ljebcrman(l97()) and» if 1 understand their terms, 1 would conclude with 
Baker (1977) and Subkoviak and Baker (1977) that Wright and Pan- 
chapakesanN 1969 procedure is conditional, not unconditional, and so 
wc have a terminology problem wherein one might say Douglas has 
presented us with an unconditional generic model rather than a con- 
ditional one. 

Although Doiiulas covers himself well when he says he would not wish 
to restrict work on other probabilistic models, it is difficult perhaps to see 
attributes elsewhere from the standpoint of Rasch assumptions. Now it is 
true that the two parameter normal ogive item analysis model docs not 
demonstrate \pccific objectivity' as defined mathematically by Rasch, 
and it would appear to be true that invariance of parameters docs not 
exist for the two-parameter model (Smith, 1975; Baker, 1977). This does 
not mean it is not useful, nor that it cannot be made to work, Douglas 
says that in the two-parameter case, we must retain people's original 
response data in order to know something about their ability, and that is 
true; but 1 think we can discover more than the limited raw^ score to 
which Douglas suggests wc are limited. I have been doing some work 
with the normal ogive model involving iterative conditional estimation of 
the two-item parameters in the first step, and person-ability in the second 
step, and tlicrc is evidence that useful results emerge. It must also be said 
that it is a complex expensive process which does not compare well with 
the Rasch logistic model. There is the further point, however, that the 
normal ogive model oilers through the use of the normal curve a link 
with ps>clu>louical theory which is most attractive. 

1 have two other brief comments. Hie first pertains to the notion of 
person-fit lo tlic model, 1 must confess lo some abhorrence of this con- 
cept, depending on how it comes to be used. Our main purpose in educa- 
tional measurement is to be able to make definitive statements about 
lelalivc [KMson abilit\ . While the concept of person-fit is statistically 
nice, given nuKlel parameters, people arc paramount and the model must 
accommodate them, within populations. Ihercin is the problem: the 
dehnilion of populations must be broad rather than narrow, and re- 
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searchers must be careful about conclusions drawn from significant 
person-lit tests. The second comment is related to the first in the notion 
of person-ability. Classical test theory is substantially concerned with 
unidimensional abilities, and we spend much of our time establishing 
reliable sub-scales in major tests. The fact is, of course, that latent trait 
models are similarly concerned with unidimensional abilities, not- 
withstanding their other merits. One of the major practical problems 
which remains with latent trait models is to show how person-ability 
values derived trom several test scales can be related and treated, and 
whether the models can be made to work with tests which measure com- 
plex abilities. Wide use of the Rasch model, for example, will depend on 
such attributes being clearly demonstrated, and fortunately evidence 
(e.g. Smith, 1975) is encou :^ging in this regard. 

In conclusion, while I have been provocative about several aspects of 
Dr Douglas\ paper, one does need to note his comments about re- 
quirements for Rasch goodness of fit tests and their power. His paper will 
tu) doubt prove to be constructive in the further development of the 
model. 
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The Use of Latent Trait Models 
in the Development and Analysis 
of Classroom Tests 



John /•*. Izard and John D. ^' ite 



INTRODUCTION 



Teachers use tests tor a number of different but related purposes. These 
include assessing the success of a relatively short sequence of instruction, 
indicating how much knowledge or skill has been retained over a period 
of time, describing or summarizing achievement over an extended period 
of study, and diagnosing aspects of curriculum which need further in- 
struction. Obviously, if curricula vary, then the supply of tests which 
mirror each curriculum presents problems. 

When using published tests to assess progress through an instructional 
sequence, teachers may be concerned that some of the questions are of 
limited value because the content ditVers from the material presented in 
their own classes, or that certain important objectives have been given 
little consideration in the test specification. Such concerns may result in 
teachers rejecting the use of published tests, making do with inadequate 
data, or using other questions in an unsystematic way in an attempt to 
meet deficiencies in the published tests. 

In order to meet these concerns, some teachers have been using collec- 
tions of test questions such as the Australian Item Bank (Year 10 
Mathematics, Science, and Social Science) and the New Zealand Item 
Bank: Mathematics (Levels 2-7 Mathematics), However, selection of 
questions on the basis of content alone and without consideration of 
other charactcristic> such as difliculty and discrimination makes the in- 
terpretation of the results obtained from such questions difficult and of 
doubtful validity. 

Since variations in curricula place different emphases on different ob- 
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jcctivcs, it seems desirable to produce collections of questions for each 
objective so that tests can be tailored to suit local needs. Provision of 
such colleciion^ will not represent any advance on the existing item banks 
unless difficulty data are supplied with the items and unless procedures 
are developed to adjust for difficulty when interpreting scores. Such pro- 
cedures will need to be easy to apply without reference to computing 
centres and will need to allow for additional questions to be added to the 
bank where necessary. In Australia, data collected in the development of 
each item bank are not readily available to users (except in Tasmania). 
I his may lead to a situation where two sets of five questions may be used 
and the achievement level represented by a score of four or more on one 
test may he totally different from that for the same score on the other 
test. 

Test analysis techniques based on the Rasch model (Rasch, 1960; 
Wriuht and Stone, 1979) attempt a separation of the respective contribu^ 
lions of person ability and item dilTiculty to a score. When this latent trait 
model IS assumed to be appropriate, item difficulty information available 
lor cacli question selected for a test may be used to make ability estimates 
lor various scores on that test. 

i Ins paper describes the development of a pool of calibrated items for 
use by teachers and then presents a number of simplified procedures 
which may be applied by teachers to construct tests from the item pool, 
to interpret the results, and to calibrate furtfier items devised by teachers! 



Dl \ LI ()|>MF:N F ()|. AN ITEM POOL 
A solution {o the problem of providing appropriate testing procedures 
for schools seems to lie in the use of item banking techniques. This is a 
lonu-lerm solution and may take the next decade to be implemented as a 
working assessment procedure at the school level. An intermediate solu- 
tion lies in the development of a pool of test items and the production of 
progress and review tests from this item pool. The essential feature of 
these tests is that they relate to well-defined objectives and are used to 
(leicrmine whether or not these ohjectives have been met or the extent to 
which (hev luue been nici. 

V\ e do not make any assumpt ion about the sequence in which skills are 
taught or the curriculum in w hich the skills are embedded. We do assume 
that skills can be taught and or learnt. We do assume that a teacher 
knows what he wanis learnt by his sludents and that he has developed a 
secjuence ot learning experiences to enable the skill to be acquired. In 
other word-, we assume thai there are some identifiable skills which can 
be taught, learnt, and tested as part of instructional programs utilized by 
teachers, in (he discussion which follows, we distinguKh between pro- 



id 

ERLC 



I)e\vl()f)nieni and Analysis oj Classroom Tests 



163 



grcss and review icsis and tlicn describe ihe development of an Item pool 
Using addition of whole numbers as the speeifie topic. 

A progress test is a small collection of items measuring performance 
on a specific skill such that a score on this test reflects the mastery status 
ot a student relevant to the skill. I'or example, one progress lest may 
have a sample ol items which involve adding two 2-digit numbers 
wiiluHit regrouping (carrying), while another progress lest may involve 
adtling two 2-digii numbers wuh regrouping (carrying) from the units, 

A revicvv test is a collection of items measuring a student's per- 
formance (HI a number of skills related by content such that a score on 
this test identities areas of strength and weakness possessed by the stu- 
dent in the specified skill areas. I-or example, a review test may have 
items involvnig adding two. three, and four 2-digit mmibers without and 
with regrouping (carrving). 

In developing the item pool lor the addition of whole numbers, (he ob- 
U'ciiv.es wcic hi si discussed with teachers from several education depart- 
nienis. f he collection of items was then trial tested with children in Years 
^, 4, and 6. 1 he data presented in this paper are taken from responses 
b\ the \ ictoiiai! children in the sample, and were analysed with Version 3 
of the HK \1 A.oinputer pfouram (Wright, Mead, and Bell, 197^)). 



A teacher will have some idea of the target population for which a test is 
selecied-"il ma\ be that teaching material tor the objective has been 
completed reeemlv or a i^kiceinent or review test may have been ad- 
minisicicd. W e aKo know trcMii trial testing of the items that we obtain a 
ntnnbei ol nenis tor each ohjeciivc which cluster within a restricted 
I ani'c. 

it a loaining pioijrani based on the specific objective has been designed 
vUid iini^lenicnieil, there arc iMc^hably reasonable expectations for the 
piograin's success. I hcsc csi^ectal ions would be reilected in a narrow 
distribution ot sctncs with a rLlativelv high mean, and are represented in 
! hjure 1. 

When comparing' the (art'ct popiilai ion's abiliiv distribution and the 
[(..'in dillk uliv d I si 1 1 hill ion tor I he test ov tests, it is convenient to use the 
feiminoloL'v siiL!i!e>iCLl bv Wiiglii and Douglas (1975). Fhe average 
diliiculiv ot (he ileiiis selected l(u the test is referred to as the height of 
the (est. //. The range ot item ditficulties is the test width, M , and the 
leiunh ol the tea i> the ninuber c>t items. /.. Where these are estimated 
tioin ^arnt^lc^. hnvct ».asc lei lei s m c Uscd. 

I he iK'^t oseiall test is the unitcHiu test (W right and Skhk*, 1979, p. 
I m v^hich Items are evenlv sfxiced troni easiest :o hardest. 1 his test is 
appioiMiate lor aiiv lait'ci population within the usable range of the test. 
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Student ability 

I torn di f f i cul tv 
^ ^ 

rdnge of the test 

Kigure I l^lxpecfed Pallern of Scores afler Complelion of a Success- 
ful Learning Program 



The icsl is described in terms of //, \\\ and L and is designed with the ob- 
ject of minimizing the standard error of measurement (SEM) subject to 
certain constraints. 

C oiisider an A/-iiem uniform test as shown in Figure 2 where n is odd. 

In the case illustrated in Figure 2, the length of the test is (5 in this 
example), the width of the test is (n ~ \)d. and the height of the test is 0. 

Figure 3 shows the corresponding information for an /7-item uniform 
test where n is even. In (his example, the length of the lest is n (4 in this 
example), the width of (he test is \n \)d, and the height of the test isO. 

If a student has a raw score of r on an //-item test then b,, the ability 
estimate of the student, is related to r by the equation 

r -~ L 
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Table 1 Limiting Value for Ability F:stimates 

lim b. 



n 


r 1 


r-2 . 


- 3 




^-5 




3 


0.693 


0.693 










4 


I.099 


().(XX) 


1 .099 








5 


1.386 


0.405 


0.405 


1.386 






6 


1.609 


0.693 


O.(KX) 


0.693 


1.609 




7 


1.792 


0.916 


-0.288 


0.288 


0.916 


1.792 



As (l—O the uniform icsi becomes more narrow and (1,-0 
then r~* 

1 + e'^ 

and in ^ , as shown lor various values ol // and r in Table 1. 



n r 

The siandard error ol is given by 

I 

















1 + e''-^''' 



As r/-0, (l.-{) 

and ihe siandard error s, - 













1 f e''-^ 




1 + e'r^ 



tiiai js, ^'"^ s. / , as shown for various values ol // and r in 

(l-{) Jrin r) 

Table 2. 

Fiuurc 4 shows the ability estimate obtained for raw scores on tests of 
vaiioiis lengths; eacii ahiliiy estimate is shown with one standard error 
bounding eilhei side ot the estimate. 

It the Rasch model Is appropriate, we can speciTy a probability that a 
person with a given ability will get a question correct. Similarly we can 
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Table 2 Slandard Krrors tor Narrow Tesls as d~*{) 



lim s. 



n 


r 1 


r- 1 


r 3 


r- 4 


r-5 




3 


1.22 


1.22 










4 


1.15 


l.()() 


1.15 








5 


1.12 


0.91 


0.91 


1.12 






6 


I.IO 


0.87 


0.82 


0.87 


1.10 




-7 


1.08 


0.84 


0.76 


0.76 


0.84 


1 



estimate peoples ability from the score ihey receive on a. set of questions 
rellecting a continuum. For a uniform narrow test, this ability estimate 
can be expressed in terms of the number of standard errors above the 
mean diniculiy of the test. For example from Tables 1 and 2, the ability 
estimate for a raw score of 4 on a test of length L 5 (where is 1 .39 
with a standard error of 1 .12. This estimate is 1 .24 standard errors above 
the mean. By making sonic normal curve assumptions, we can infer that 
there is a probability of 0.89 that the student has an ability greater than 
the lest mean. This inference may be used as a definition of mastery for a 




H>»iirc 4 Ahilil> Kslimales for Variou.s Raw Scores on a Number of 
I niform Narrow Tesls 



i 
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narrow uniform test. In other words, a score corresponding to an ability 
sufficiently higher than the mean may be regarded as evidence that items 
tVom the same domain will be answered successfully. 

Our analysis shows that for tests of length ie . than 5 the raw scores 
other than 0 or n give insufficient indication of mastery or non-mastery as 
dehned. For a narrow uniform test of length 5 we have mastery for 
non-maslcry for r 0,1 and there is no indication of the respon- 
dent\ mastery status for 2,3. 

If a test of length 5 is considered satisfactory, then tests with A =6 or 7 
might be considered wasteful. Given the constraint that only discrete 
scores are possible, only limited additional information is available when 
tlie test is lengthened by one or two questions. 

We can now look for values of d for which the uniform test can be 
described as narrow. In tlie uniform test design for tfiis section, test 
widiii 

u- (// /---: ^ 

where r is the raw score, and ability estimates for given /and vv are ob- 
tained from the I JORM procedure (Wright and Stone, 1979, p. 144). 

lhai IS, ut/ 0.5) f In ^ ^' 

Hence 

r ^ - " 

h. (J{n I) ' 0.5 iln 

1 e 

is an ability cstiniaic for (J>i). 

F or example, given that n 5 and r K /-n, 139, and is tound so 
iliai llie discrepancv between and /),,,.o, given by is less 

ifian f , where f is a designated accuracy value, 

\^2(J ( 1.39) <(. 

I e' " ' 

table ^ shows ihis discrepancv or error term /(<-/) for // - 5, r ^ 1 when 
(1^0. Similar values \ov f{d) will be obtained when n - 5 and r^4. 

VV itfi d 0.3, n 5, u' 1 .2, tfie ability estitnate changes in magnitude 
bv 0.03 from the ability estimate derived from a narrow test. 

I hts represents a 2.2 per cent change in the ability estimate or 2.7 per 
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Table 3 Values of the Krror Term f(d) for Values of d when a7 = 5, r= I 

a fief) 

0. 1 0.0003 
0.2 0.012 
0.3 0.032 
0.4 0.059 

cent of the standard error. Similarly, with ^ = 0.4 the percentage change 
is 4.2 per cent or 5.3 per cent of the standard error for a test of zero 
width. 

If we have a cluster of items across a range of about 1 .5 logits, then we 
can select five of these items to construct a narrow uniform test. If these 
items are specific to an objective then a progress test is constructed with 
the f ollowing properties: (a) a student has mastered the skill if he scores 5 
or 4; (b) he has not mastered the skill if he scores 0 or I; (c) his mastery 
status is not determined for scores of 2 or 3. 

Wc will summarize the design so far by looking at data from an actual 
test which consists of 20 items constructed to assess the objective of the 
addition of two 2-digit addends with a sum less than 100 and regrouping 
(carrying) from units to tens. 

Mgure 5 shows the difficulty estimates obtained for each item when 
these 20 items were calibrated with other addition items. The mean item 



-OB -0^ -OA 



--Oj^ 12:1. 



-O.r 0.0 



02 0.3 



M.' ♦-r Mf) ♦-Jfj ^22 + l."! +:8 Mo 
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1'.) 



a 
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Fi^iure 5 Item Difficullies for each Item on One Addition Test 
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difficulty tor all 20 items is shown by i and the magnitude of the standard 
deviation of the difficulty estimates is shown as the dark line. 

We can construct a number of progress tests by selecting items from 
the pool, two of which are: 



Progress fcsi A: 


35 


57 


58 


66 


37 


(( odcd ■) 




+ 26 


+ 12 


4-J8 


+ ]5 


dirticuliv: 


0.47 


- 0.35 


0.23 


- 0,12 


0,03 


Progress fcsi B: 


69 


51 


19 


39 


58 


{C odcd ) 




f 39 


f 35 


+ 36 


f 38 


ditliculty ; 


0.61 


0.47 


0.35 


-0.17 


0,03 



Vox Progress Test A 

-0,23, and 

L 

vv 3.5 v'cr^y/ - I) = 0,68 

Using Table 1, the ability estimates for I, 2, 3, and 4 are: 
r Ability estimate 

1 0.23 ~ 1.39= - 1.62 

2 -0.23-0.41 = -0,64 

3 -0.23 + 0.41 = 0.18 

4 0.23 + 1.39- 1.16 

with standard errors of 1.12, 0.91, 0.91, and 1.12 respectively, using 
Tabic 2, 

In the progress test model, these values are not so important because 
the user of the progress test is interested in the mastery status determined 
from the raw score. 

f hc user knows that r 0, 1 corresponds to non-mastery requiring 
further teaching, r 4,5 corresponds to mastery, and ^-2,3 will require 
further testing to confirm masiery status. 

However, for the purposes of this discussion, the ability associated 
with a raw score may be estimated by the UFORM procedure (Wright 
atid Stone, 1979, p. 144). 

I his ability estimate is given by the equation 

b, -/? + vv(/ -0.5)-f In ' 



ERIC 



1 70 The Improvement of Measurement 

Table 4 Comparison of Kslimales of Ability iJsing UFORM and 
Narrow Uniform Test Assumptions 

Approximate cstimaie UFORM csiimate 

0.2 1.62 1.12 1.63 1.12 

0.4 0.64 0.91 0,64 0.92 

0.6 0.18 0.91 0.18 0.92 

0.8 1.16 1.12 1.17 1.12 



with standard error given by 

fvv [l ~r1 

A comparison of the ability estimates with the approximations 
resulting from narrow uniform test assumptions is presented in Table 4. 

Kigurc 6 illustrates the mastery staius associated with a raw score of 4 
and the non-niastcry status for a raw score of 1, It also illustrates that 
there is insutlicicfit information to determine the mastery status for r~2 
or r - 3 . 

Kor a narrow uniform test we assume that all items have equal diffi- 
culty. Where - 5, and -0.23, the ability estimate for r^A is 
/;4 ^ 1.17. Hence 

/;(.v- K/>4 = 1.17, r/- - 0.23)- ^ -0.802 

1 +e' 

is the probability of a success on an item encounter for a student with 
ability 1,17. .Similarly, 

p(A- - 0, /;4 ~ 1.17,^/- 0.23)- ^ = 0.198 

1 i 

is the probability of a failure on an item encounter for a student with 
ability 1.17. I he probabilities of the student obtaining various raw scores 
given that his ability is 1.17 are 

/;(r- 5)- (0.802)^ -0.332 
p(r-.A)~- 5(0.802)^(0.198) -0.410 
p{r 3) 10(0.802)^0.1^8)- 0.202 
p(r- 2) 10(0.802)^(0.198)^ - 0.050 
p(r ^ 5(0.802) (0.198)^=0.006 
p{r 0) (0.198)^ -0.(K)03 
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lest moan ' 1 s.d. 



r I 



Figure 6 Abilitv Kstimates from Raw Score in Relation to Progress 
Test A 



When a siudcni has an ability of 1.17, the maximum probability of 
makmg an incorrect statement of his mastery status is therefore 

0.202 + 0.050 f 0.006 4-0.0003 -0.258 

This estimate is inflated due to the inclusion of the cases where 
and r 1. 

DL:SIGN OF REVIEW TESTS 

In the case of the progress tests, the items for each test had a relatively 
lower range o\ difliciilly. However if review tests are to be constructed by 
selecting items from the various progress test item pools, the difficulty 
continuum for addition items ranges from 4 to +4. It is possible to 
design several review tests to span the continuum as shown in Figure 7. 

W e can consider one such test w here h = - 2,5, vv =^3.0, and L is to be 
delertnined. 

For the progress tests we were able tc^.use the characteristics of a nar- 
row test to determine the value of 

In the review tests vsc cannot assume 'narrowness' and will have to 
deterniine /. from an assumption about the magnitude of the standard 
error ot measurement, using'a method proposed by Wright and Stone 
(19''9, p. 140) in which 

SEM'^ 

where is tenned the error coefficient, SEM the standard error of 
measurement and / the expected relative score. They define the error 
coefficient as 

|1 r'ifl 

fi»r itie design above we obtain 
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1 ! 1 ! ! ! 1 ! 1 1 1 

-4 -3 -I 0 1 2 3 4 



Figure 7 Ability Ran|>es to be Covered by Review Tests 



Table 5 Values of for vv=:3 and /= 0.1 (0.1)0.9 



J 




0.1 


11.8 


0.2 


6.9 


0.3 


5.5 


0.4 


4.9 


0.5 


4,7 


0.6 


4.9 


0.7 


5.5 


0.8 


6.9 


0.9 


11.8 



Table 6 Test Length Associated with Values of SKM and C when 

w=3 



SE :M 




11.8 


6.9 


5.5 


4.9 


4.7 


0.1 


0.01 


1180 


690 


550 


490 


470 


0,2 


0,04 


295 


173 


138 


123 


118 


OJ 


0.09 


131 


77 


61 


54 


52 


0,4 


0,16 


74 


43 


34 


31 


29 


0,5 


0.25 


47 


28 


22 


20 


19 


0,6 


0.36 


33 


19 


<'15 


14 


13 


0,7 


0,49 


24 


14 


11 


10 


10 


0,8 


0.64 


18 


It 


9 


8 


7 


0.9 


0.81 


15 


9 


7 


6 


6 


1.0 


1.00 


12 


7 


6 


5 


5 


1.1 


1.21 


10 


6 


5 


4 


4 


1.2 


1.44 


8 


5 


4 


3 


3 


1.3 


1,69 


7 


4 


3 


3 


3 


1,4 


1,96 


6 


4 


3 


3 


2 


1.5 


2,25 


5 


3 


2 


2 


2 
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Table 7 Hem Difficulties Generated for Review Test 1 



2 
3 
4 
5 
6 
7 
8 



3,9 
3.7 
3.5 
3.3 
3.1 
2.9 
2.7 
2.5 



9 
10 
11 
12 
13 
14 
15 



2.3 
2.1 
1.9 
1.7 
1.5 
1.3 
1.1 



Values for have been tabulated for various values of /as shown in 
Table 5, and the corresponding test lengths for various values of SEM 
are shown in Table 6. 

A test length of L=^\5 would not be an unreasonable length for a 
review test in terms of the time for administration. If we look at L= 15 in 
the body of Table 6 we see that SEM ranges from 0,9 (for /=0,1, and 
/-0,9) to SEM ^0.6. 

The formula 6, ^ A/ - (h'/2/.)(L -i- I - 2/) for /=I,I5 generates the 
preferred item difficulties for the 15-item test, as shown in Table 7, 

We now have to decide on the items to include in a review test. Table 8 
shows the overall review test structure which could completely span the 
addition continuum by including items from a number of objectives 
pools. The selection is shown in Table 9; the total deviation from desired 
difficulties is O.fX). o 

Wc are now in a position to calculate the characteristics of review- 
test 1. 

Tc^t height is estimated by: 



LV/. 
L 



- -2.5 



Table 8 Review Test Design to C over Addition Continuum 



Review Test 



Objectives (codes) 



3 
4 
5 
6 



31. 32. 33 
33. 34, 35 
35. 36. 37 
37. 38. 39 
39. 40. 41 
41. 42, 43 
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Test width is estimated by: 

w= 3.5V(E^TF)7(7rn) 
w=3.13 

Ability estimates for each raw score are given by: 

= -2.50+3.l3C/'-0.5) + ln ' ^ 
with standard error 

. i-e-' , K| 0-2 

Table 10 presents the ability estimates and standard errors for various 
scores on this review test. 

If this review test is being used as an instrument to ascertain the 
position of a student relevant to the objectives after an extended period 
of instruction related to the objectives, then the student's score can sug- 
gest which objectives have been mastered provided that the average 
difficulty of each objective is known. Further we can argue that: 

(i) r< 3 indicates that a less difficult review test is necessary; 

(ii) 3<r< 12 indicates that student's ability is in the range of the objec- 
tives; 

Table 9 Item Selection for Review Test 1 





6. 








Item 


Desired 


Selected 




Item number 


number 


difficulty 


difficulty 




and objective 


1 


-3.9 


-3.85 


-0.05 


(9,31) 


2 


-3.7 


-3.66 


-0.04 


(2, 31) 


3 


-3.5 


-3.48 


-0.02 


• (14,31) 


4 


-3.3 


-3.33 ' 


0.03 


(8,31) 


5 


- 3.1 


-3.06 


-0.04 


(2, 33) 


6 


- 2.9 


-2.94 


0.04 


(5, 33) 


7 


^2.7 


-2.82 


0.12 


(3,31) 


8 


-2.5 


-2.52 


0.02 


(7, 33) 


9 


-2.3 


-2.27 


-0.03 


(8, 33) 


10 


-2.1 


-2.12 


0.02 


(7, 32) 


11 


- 1.9 


- 1.91 


0.01 


(11, 33) 


12 


- 1.7 


- 1.72 


0.02 


(17, 32) 


13 


"1.5 


- 1.49 


- 0.01 


(6, 33) 


14 


- 1.3 


-1.33 


0.03 


(13, 32) 


15 


-1.1 


-1.00 


-0.10 


(18, 32) 
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Table 10 AbilHy t:slimates and Standard Errors lor Raw Score Values 
on Review Test 1 



^ score 








r 


^ L 


/>, 




1 


().()67 


5.47 


1.06 


2 


0.133 


4.65 


0.79 


3 


0.2(X) 


4.12 


0.68 


4 


0.267 


-3.69 


0.63 


5 


0.333 


"3.32 


0.59 


6 


0.4(X) 


"2.98 


0.57 




0.466 


2.66 


0,57 


« 


0.533 


2.34 


0.57 


9 


0.600 


2.02 


0.57 


10 


0.667 


- 1.68 


0.59 


!l 


0.733 


- 1.31 


0.63 


12 


0.8(X) 


" 0.88 


0.68 


n 


0.866 


-0.35 


0.79 


14 


0.933 


+ 0.47 


1.06 



Uii) r> 12 suggests a more ditlicult review test relating to other objec- 
tives because the stucient\ ability is beyond the range of these objec- 
tives. 

USING AN ITEM BANK 
Ol CALIBRATED ITEMS 
Once an item bank or pool is established, we can make the questions and 
associated data available to teachers. However, the majority of teachers 
in Australian schools do not have ready access to computers and it is 
necessary to provide simplihed procedures which will not need 
sophisticated computing facilities. By contrast, hand-held calculators arc 
widespread and small programmable calculators are becoming more 
common. Our experience in lecturing to teacher trainees and graduate 
teachers indicates that worksheets can assist teachers to collate informa- 
tion and to produce relevant statistics. Accordingly we sought to develop 
worksheets which would enable teachers to use information from an item 
bank to construct tests with known ciiaractcristics, to check that their 
group of students perform on such tests in a manner consistent with the 
performance of the reference group used to set up the item bank, and to 
scale their own items to the continuum underlying the item bank. 

Where the items in an item bank have been scaled on a single con- 
tinuum, a teacher may construct cither a. test for relatively precise 
measurement in a particular part of the continuum or a broader lest 
which will provide estimates of the range of achievement in that 
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classroom. If desired, both types of test could be used. The results on the 
wide range test could suggest the types of item which might be tested in 
more detail. 

The UF^ORM procedure referred to earlier in this paper may be used to 
estimate the ability associated with each raw score on a test using the item 
bank data for each of the items in the bank, (Ability estimates cannot be 
made for zero or a perfect score.) This procedure assumes that the items 
in the bank are calibrated on a single continuum and it is recommended 
that items are uniformly spaced in difficulty for the intended target 
population (Wright and Douglas, 1975). 

The difficulty of the items selected for the bank is averaged to estimate 
the test height, and the variance of the item difficulties is used to estimate 
the width of the test. The estimated ability is 

b,= h^w{f-O.S)^\x\(A/B) 

where h is the mean difficulty of the items, 

w is the estimated test width 

/ is the proportion of the items correct 

A is 1 exp( - wf), and 

^ is 1 exp[ vv(l -J)\. 

The associated standard error is 

s,^[(w/L)(C/AB)\ 
where L is the length of the test and 
C is 1 - exp( - w). 

The worksheet for this task (see Appendix 1) is used with a calculator 
having 'e'' and 1n* function keys. 

Table 1 1 shows the results obtained using the worksheet for a test of 
five items from an item b. .ik with difficulties -0.560, -0.174, +0.012, 
-f 0.197, and +0.573. 

If required, the ability estimates may be recalculated for the six-item 
test which results when an item of difficulty -f 0.975 is added to the five- 
item test. Table 12 shows the corresponding results. 

Instead of using the worksheet, we can obtain this table of ability 
estimates and associated standard errors from convenient tables 
presented in Wright and Stone (1979, p. 146). For the example shown 
above, the estimates from the Wright and Stone tables are compared 
with the worksheet calculations (all correct to 2 decimal place.s) in Table 
13. 

These estimates from item bank data can be compared with actual 
observations to see whether the predictions provide useful information. 
However such a comparison requires a procedure to calibrate items with 
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Table 11 Person Ability Estimates and Associated Standard Errors 
tor Various Scores on a Five*item Test 

Proportion Ability Standard 



Rtxw ^core correct estimate error 

r J h, 5; 

1 0.2 1.430 1,134 

2 0.4 0.414 0,932 

3 0.6 0.433 0.932 

4 0.8 1.450 1.134 



Table 12 Person Ability Estimates and Associated Standard Errors 
tor V arious Scores on a Sivitem Test 





Proportion 


Ability 


Standard 


Ri\\\ score 


correct 


estimate 


error 


r 


J 


h, 


■V 


1 


0.17 


- 1.538 


1.118 


-> 


0.33 


- 0.572 


0,894 


3 


0.50 


0.170 


0,846 


4 


0,67 


0.913 


0.894 




0.83 


1.879 


1.118 



Table 13 Abilit> Estimates and Associated Standard Errors tor 
V virions Scores 



\\ orksiieci calculations Wrighi and Stone" 



Score 


b, 


s. 


ih 


V, 


1 


1.43 


1.13 


- 1.49 


1.16 


-> 


0.41 


0.93 


0,39 


0.94 


3 


0.43 


0.93 


0.41 


0.94 


4 


1.45 


1.13 


1.51 


1.16 



\\t\Mh\ aiKl Sionc. hrv, lablcs "^..Vl. T V 2. p. 146. 

another group as well as a procedure to check whether both the original 
group and the new group react to the items in a consistent fashion. Both 
types ot procedure are now described. 

CALIBRATION OK ITEMS USING PROX 
VVrigfu and Stone (1979) describe a procedure called PROX which pro- 
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vides a Rasch calibration ot icsl items and 'approximates the results ob- 
tained by more elaborate and hence more accurate procedures extremely 
weir (Wright and Stone, 1979, p. 28). 

This procedure assumes that Item difficulties and person abilities are 
more or less normally distributed. Item difficulty is estimated with 
reasonable accuracy (when compared with more sophisticated computing 
procedures) where both person abilities and item difficulties are more or 
less symmetrically distributed around one mode, and the location and 
spread of person abilities and the item difficulties are similar, 

rhc procedure requires the responses to be listed in a student-by-item 
matrix and the calculations are carried out on marginal totals of correct 
and incorrect counts tor both items and students. (If there are any 
students with perfect or zero scores and any items with perfect or zero 
success rates, these are deleted from the matrix before the calculations.) 
This listing of student responses may present the classroom teachers with 
a si/ablc clerical chore. However a class analysis chart after the style of 
the( AIIM material (ACER, 1976; 1979) allows the teacher to avoid this 
clerical work by transferring the actual student responses to a chart. 

Using PROX it Is possible to calibrate items using a hand calculator 
and paper and pencil. Wright and Stone point out that the PROX pro- 
cedure has an application in the classroom but, if classroom teachers are 
to use such a procedure, it Is our view that further assistance needs to be 
provided. I his assistance is provided in the form of a worksheet (see Ap- 
pendix II) which uses marginal totals from a class analysis chart, and an 
example of its use is presented in Table 14. When the procedure is ap- 
plied to these data, the results shown in Tables 15 and 16 are obtained. 



I ho procedure for calibrating teacher-made items on to the same con- 
tinuum deluicd by an item bank requires that items constructed by the 
teacher be administered to a group of students together with a set of 
items from the item bank. The items from the item bank constitute the 
link, and the quality of \\\\^ link can be investigated using (he procedures 
advocaicd by Wright and Stone (1979, p. 96-1 16). The quality control of 
the link enables the original results obtained by the reference group 
(which provided the data on the banked items) to be compared with the 
results obtained from the sampled items test. We would expect that the 
observed difficulties for the link items would differ from the difficulties of 
those Hems obtained lor the reference group to the extent that the group 
ot persons being tested is more or less able than the reference group. 
Alter adjusting for the group difference in ability, the remaining 
discrepancies for each item are expected to have a mean of zero (Wright 
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Table 14 Data Matrix for Ten Persons and Six Items 



Iictu 



Person niimber 



Item 



luiinbcr 


1 2 


3 


4 


5 


6 


7 


8 


9 


10 


score 


I 






1 


1 


1 


I 


0 


i 


0 


8 








0 


1 


0 


0 


0 


0 


I 




3 






0 


0 


1 


1 


0 


0 


0 




4 






1 


I 


0 


0 


I 


0 


0 




5 






0 


0 


0 


0 


0 


0 


0 


I 








1 


0 


0 


0 


0 


0 


0 


2 


Person 




















.V- 10 


score 


5 4 


4 


3 


3 




2 


1 


1 


1 


6 



Table 15 Item Calibrations Obtained Using PROX 

v 



item 
runnbei 



I (cm 

nCOK* 

s 



In 



s 



1.386 
O.tHX) 
().(KH) 
().(XH) 
2.197 
1.3S6 



Mean (),366 
L ■ \ariance ■ 1 .573 



Initial 


Corrected 


Standard 


calibration 


calibration 


error 


1,752 


-2,399 


1,082 


0,366 


-0,501 


0.866 


0.366 


0,501 


0,866 


0.366 


O..M)l 


0,866 


1.831 


2,506 


1.443 


1.020 


1.396 


1,082 



I able 16 Person Measures Obtained I sing PROX 



Peisori 
nutnbct 



9 



score 
r 

5 
4 
4 

3 
3 



Initial tneasure 

In ^ 
I r 

1,609 
0.693 
0,693 
().()(H) 
(),(HX) 
0.693 
0,693 
1.609 
1,609 
1,609 

Mean ■ 0,322 
I \ariance ^ 1,250 



Corrected 
measure 

2,287 

0,985 

0,985 

0,(KK) 

0,(XX) 

0,985 
-0.985 
- 2,287 

2,287 

2,287 



Standard 
error 



1.557 
1.231 
1,231 
1,160 
1,160 
1,231 
1,231 
1,557 
1.557 
1,557 



1 



ERIC 



180 



Th^mpro vement of Measurement 



and Sionc, 1979, p. 1 14). II both groups react to the test items In a con- 
sisieiu way, then all items are adjusted to the original scale defined by the 
item bank reference group data. 

The worksheet (see Appendix III) devised for this task uses the data 
f rom the item bank and the data obtained from the group of siudems for 
the link items, (The latter data could be obtained using Worksheet 2 as 
described above.) The ditVeience in ability between the reference group 
and the group of students is estimated as the mean of the diflerences in 
difficulty for each item. The magnitude of the remaining discrepancy is 
considered for each item separately as well as for the set of items from 
the bank. 

it it is decided that the discrepancies are small enough to be ignored, 
then the calibrations of the teacher- made items are adjusted to llie same 
extent. If the discrepancies are too large to be ignored, then It may be 
necessary to conclude that the group of students reacts to the questions in 
the item bank in a different way from the group which provided the 
original item bank data. Further, the ditTerences between the two groups 
cannot be accounted for by a difference in ability. 

In order to illustrate this procedure, results from a calibration of a 
55-item mathematics '.est were used to construct two seven item tests. 
Results for these tests on another sample from the same population as 
that sampled for the calibration provided ^observed difficulties*. The 
difficulties for the reference group and the sample group are summarized 
for two collections of items in Table 17. The table lists the approximate 
chi-squared estimates associated with residual discrepancies for each 
item. The discrepancies are related to the particular group of seven items 
for which the calibration was performed. Test A illustrates an acceptable 
hnking collection of items since the chi-squared estimate of 7,87 is less 
than the critical value of 14.07. Test B items do not appear to be consis- 
tent for both the reference group and the sample group (chi-squared 
estimate is 33.88) and therefore Test B is not a satisfactory link. In Test 
A, item 24 appears to be a poor item for the link; in Test B, items 13, 49, 
and 54 contribute most to the poor quality of the link. 

This section set out to describe the procedure tor calibrating teacher- 
made items for a particular line of inquiry onto a calibrated collection of 
items along the same line of inquiry. After finding suitable link items, the 
teacher-made items are calibrated onto the existing scale using a simple 
translation calculated from the ditference between link item difficulties 
on the reference scale and the scale obtained on the link test. In develop- 
ing the computational steps for this procedure, it has become clear that 
we must investigate further those characteristics of an item or groups of 
items which could indicate to us suitability for the linking process. 
.Another area requiring further investigation comes immediately to mind 
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Table 17 Observed DifHcullies for Two Spven-iiem Tests 



Item 


item bank 


Observed 


Residual 


Chi-squared 


number 


difficulty 


difficulty 

.. — 


discrepancy 


estimate 


7 


- 0.928 


~ 1.224 


0.115 


0.297 


16 


- !.854 


- 2.106 


0.071 


0.113 


24 


0.314 


-0.953 


0.458 


4.711 


39 


1.262 


1.127 


-0.046 


0.048 


46 


0.316 


0.326 


-0.191 


0.819 


53 


. 0.912 


0.912 


-0.181 


0.736 


55 


1.873 


1.918 


-0.226 


1.147 


11 


- 0.584 


- 0.857 


0.356 


2.624 


13 


2.122 


- 2.791 


0.752 


11.711 


19 


- 0.954 


- 1.131 


0.260 


1.400 


38 


-0.185 


- 0.209 


0.107 


0.237 


45 


0.482 


0.754 


-0.189 


0.740 


49 


1.002 


1.696 


-0.611 


7.731 


54 


1.779 


2.537 


-0.675 


9.435 



when we consider, if two seven-item tests produce such different link 
qualities, whether they also produce different ability estimates for the 
group to which the items are exposed. The answer to this investigation 
may have implications for item banking and although it would be the 
substance of another paper a preliminary analysis is reported in Table 18. 
Table 18 shows the estimated abilities for each raw score and the cor- 



Table 18 Ability Eslimales and Associated Standard Errors for 
Two Seven-item Tests 



Raw Estimated from 

- score calibration daia Observed 

b, Sf bf Sf 

Test A 1 - 2.16 1.17 -1.99 1.18 

2 ~ 1.07 0.96 - 1,06 0.98 

3 0.22 0.90 0.32 0.92 

4 ' 0.58 0.90 0.37 0,91 

5 1,44 0.96 1,07 0.95 

6 2.53 1,17 1,96 1,16 



lesf b 



2,43 
1,34 
0,48 
0,32 
1,17 
2,26 



1,17 
0,96 
0,90 
0,90 
0,96 
1,17 



-2,19 
-1,17 
0.37 
0,38 
1,19 
2,20 



1,24 
1,02 
0,95 
0,96 
1.02 
1.22 
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responding observed abilities tor the second sample on both ihe 
calibrated and observed data. The ditterence between ability estimates 
tVom the four sources are well within acceptable tolerances when we 
examine the respective standard errors. When this area is investigated 
completely we would expect to develop instructions for classroom use of 
established calibrations of itetns. 



(t has been shown that a pool of items which has been calibrated onto an 
ability scale using Rasch analysis can be used to assist classroom assess- 
ment in two ways. The first application involves the production of pro- 
gress and review tests by a test development group. The users of these 
tests do not necessarily have to understand the underlying theoretical 
structure of the tests, but they must know the simple rules to use in inter- 
preting raw scores of students on the tests, 

riie second application involves the u,ser with decision-making 
associated directly with the pool of items. Although many easy-to-tollow 
worksheets were developed for calibration of items, estimation of 
abilities, and use of the established item pool, it is anticipated that the 
user would need lo be aware of the assumptions and concepts of Rasch 
measurement if the sheets were to be used. In either case the use of Rasch 
analysis has been directed towards the provision and development of ob- 
jective measuring instrinnents in which the teacher has a great deal of 
Mcxibiliiy in choosing the individual questions that match the teaching in- 
tention. 
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APPENDIX I 
Calculalion of Abilily E^limales from llem Bank Dala 
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VVorksheel 2: Calibration of a TesI I sing PROX (Part 2: Persons) 
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Worksheet ?: Calibration of a Test lisiiift PROX (Part 3: Calculations) 



Fruni Purl I 

Mean ^ Column (4^ Total _ 
No. of Items 



J ; (niean,)^ 



Variance, - U- 
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APPENDIX III 

Worksheet 3: Comparing a Sample Group with a Reference Group 
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The Use of the Rasch Latent Trait 
Measurement Model in the Equating 
of Scholastic Aptitude Tests 

George Morgan 

This paper reports the results of an exploratory investigation which at- 
tempted to assess the capabilities of Rasch's Simple Logistic Model in the 
cahbration and equating of final and trial forms of the Australian 
Scholastic Aptitude Test (ASAT). The investigation had two main aims: 
(i) to determine to what extent the items in the final and trial forms of the 
ASAT can be successfully fitted to latent variables of general scholastic 
aptitude determined by calibrations of items in whole tests and various 
sub-tests, based primarily on content, and (ii) to determine whether 
i;quutings of ASAT forms can be undertaken successfully at the whole 
test or sub-test levels. 

The items in the ASATt are grouped into units, each unit being con- 
cerned with a particular theme, A unit begins with stimulus material, 
presented in a variety of forms, drawn from the four broad subject (con- 
tent) areas of humanities, social science, mathematics, and science, and 
is followed by a group of binary-scored, multiple-choice items related to 
the stimulus material. The items in the ASAT are designed to measure a 
wide range of abilities and skills, such as those concerned with the inter- 
pretation and comprehension of scholastic materials, that are relevant to 
academic courses at the Year 12 level of secondary education and at the 
tertiary level. When the tests are constructed, care is taken to avoid using 
materials directly related to Year 12 syllabuses. 

The ASAT is an omnibus test of scholastic aptitude but it does not 
relate to any particular theoretical model. Broadly the test's structure is 
deierniined to a large extent by the pool of abilities and skills underlying 



The ASA T is a secure icsi, but n booklet containing a sample collection ot items may be 
(obtained Icr inspection. 
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the pariicular items which happen to be incorporated in the test, when a 
form of the test is constructed. 

Although a considerable literature exists on the ASAT (Lees, 1978), 
for the most part it is concerned with the test's power to predict success in 
tertiary courses and its use as a scaling instrument. 

It appears that the earliest investigation into the psychometric proper- 
lies of the ASAT was undertaken by McGaw and Greddon (1973). A few 
years later a comprehensive study of the psychometric properties of the 
1973 version of the test, ASAT-B, was carried out by Bell (1977) and, 
more recently. Bell (1979) factor analysed the 1977 version of the test, 
ASAT-F\ He found that the first principal component of ASAT-F ac- 
counted for 10 per cent of the test variance. These studies indicated that 
the ASAT is factorially complex, and that at a global level the test can be 
characterized by a general ability factor, and more specifically by factors 
representing quantitative and verbal abilities. 

In his study of ASAT-B, Bell (1977) analysed the test using traditional 
item analysis procedures as well as those based on the Rasch Simple 
Logistic Model. He found that about two-thirds of the ASAT-B items 
conformed to the Rasch model. More recently. Bond (1978) applied the 
Rasch Simple Logistic Model in the multiplicative binomial framework 
in an analysis of ASAT-F. He suggested that Rasch measureriient of the 
ASAT should be based on the units rather than on the items, because the 
items tend to lose the part played by the stimulus material of each unit. 
Eleven of the eighteen units in the test were calibrated by him to a 
unidimensionallatent trait of general ability. - - - • 

With a factorially complex test like the ASAT, it is not clear which 
group of items in a form should be calibrated together in order to permit 
satisfactory equatings between forms. Obviously, basing the equatings 
on the estimates from a Rasch multiplicative binomial analysis of the 
units in forms is impracticable, because the number of link units required 
for an adequate analysis would require the construction of inordinately 
long tests. 

Even so, if equatings between ASAT forms are to be based on the in- 
dividual items in the test, it is not clear whicfi items should form the 
links. Reckase (1979) showed that for factorially complex tests, the 
Rasch Simple Logistic Model estimates the sum of the factors when there 
is more than one independent factor, and estimates the first dominant 
factor when it exists. In the latter situation, he found that stable item 
calibrations can be obtained even if the first factor accounts for less than 
10 per cent of the variance. These findings suggest that stable ASAT 
equatings might he obtained at the whole test level and, if so, the various 
test forms could be equated within tfie existing framework of test 
development and application. 
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Ciirrendy ihc test's main runction is to act as a uni-valued variable in 
the scaling of Year 12 public examination results or teacher assessments 
of student achievements where such examinations do not exist. If the 
Rasch Simple Logistic Model could be used to equate the ASAT forms at 
the whole test level, equated scores could be derived from the ditlerent 
forms of the ASAT and then be applied in the scaling process, thus 
bringing the etTects of the scaling to a common base. Otherwise equatings 
may need to be underlaken at a sub-test level determined by criteria like 
content homogeneity or pure factor structure. 



The ASAT program of test development has evolved over a number of 
years and is now well established. It provides a somewhat routine 
schedule to be followed in the construction and trial testing of each form 
of the test. Thus, beginning with a pool of units in each of the subject 
areas of mathematics^ science, humanities, and social science, the units 
are processed and, from these, units are selected for inclusion in the trial 
forms. The trial forms are then administered to a sample of Year 12 
students. Subsequently, using classical test theory principles of test con- 
struction in conjunction with expert considerations about the content 
and kinds of abilities and skills measured by the items, units are selected 
to make up the fmal form of the test. 

In formulating the course of the investigation, the intention was to 
allow work to proceed within the framework of the existing program of 
test development outlined above. This seetried a profitable course to 
follow, since preliminary calibrations of the itemii in the whole test and 
some sub-tests of the ASAT-G showed that appreciable numbers of the 
available pool of items were satisfactorily fitted to the Rasch latent 
ability continuum associated with the whole test and the sub-tests based 
on the four broad subject areas. 

An alternative course would have involved the creation of Rasch-like 
forms from the outset, perhaps aiming to have forms of equal length and 
containing sufficient numbers of link items to ensure satisfactory 
equatings of the forms. However, such an approach would have entailed 
going beyond the current practice of test development as outlined in the 
test's specification (ACER, 1978), and this would need to be agreed to by 
the ASAT users. Nevertheless it seemed at the time that an exploratory 
investigation within the existing test development framework would be 
able to shed some light on the kinds of results that might be expected 
when Rasch measurement technique^ are applied to the ASAT, 

Whole Tests and Sub- Tesl.s 

The investigation was concerned with equating two final forms, ASAT-G 
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Tably \ Brief Statistical Description of ASAT Forms 





Number of 


Number" of 


Mean 


Standard 


KR 20 


I orm 


• items 


students 


score 


deviation 
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ASA r-cj 


1(K) 


2345 


64 


16,0 


0,93 


ASAI-H 


KK) 


2422 


60 


14.8 


0.91 


V 


. 71 


246 


32,2 


8.2 


0.79 


W 


72 


252 


29,7 


8.2 


0,78 


V 


72 


249 


34,6 


8.6 


0,80 


/ 


72 


248 


3i:4 


8.6 


0,80 



Daia tor ASA I -CI and ASA I -11 came troni ^ludcnis \\\ ihc Ausiralian C"apital 
Icrritorv: daia tor tonus V, V\ , \ , and / tioni siudcnis in l usmaniii and South Aiisiraiia 
who \ooV part in the trial icsiinu ot ASAI-H. 

and ASAT-H, through four trial forms of ASAT-H. Equalings of forms 
were analysed using various combinaiion.s of the items which were in- 
dependently calibrated. Table 1 provides a brief .statistical description of 
these forms. 

Tabic 2 shows the distribution of items in each form acro,ss the four 
subject areas. The items in each of the six forms were grouped into a 
whole test, consisting of all the items in the form, and eight sub-tests 
which were: Mathematics/Science, Humanities/Social Science, 
Humanities, Social Science, Mathematics, Science, Quantitative, and 
Verbal, hxcept for the Quantitative and Verbal sub-tests, each sub-test 
contained all items in the relevant subject area(s) that were available. 

In constructing the Quantitative and Verbal sub-tests, the following 
arbitrary criteria were used. Humanities items were assigned to the Ver- 
bal sub-test and mathematics items to the Quantitative sub-test. Of the 
science and social science items, if the point-biserial correlation between 

Table 2 Distribution of Items in ASAT Forms Accordinjj to Subject 
Area 
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an item and the Malhcnuuics Science sub-test was greater by 0.03 than 
the point-biserial correlation between the item and the Humanities/ 
Social Science sub-test, the item was assigned to the Quantitative sub- 
test. Conversely items were assigned to the Verbal sub-test. In cases 
where a decision could. not be made on the basis of correlations, items 
were classified on the basis of their face validity. A few items could not 
be classilied using these criteria and hence were omitted from these sub- 
tests. 

Link Structure 

The following list gives the length of each form, the subject area of the 
link items in the form, and the name (in parenthesis) of the form(s) to 
which it was linked, hornis V, W, Y, and Z are the trial forms of 
ASAT-H u^ed in this study. 

horm V 71 items 16 humanities items, 10 science, and 5 
mathematics items (Form H) 
10 humanities items (F'orm Z) 

Form W 72 items 14 humanities, 6 social science, and 9 science 
items (Form Fi) 

5 science and 5 rr.at hematics items (F\)rm Y) 
Form Y 72 items 10 social science, 6 science, and 4 
mathematics items (Form G) 
5 science and 3 mathematics items (f orm Fi) 
fO humanities items (Form V) 
l orm / 72 items 10 humanities, 5 science, and 5 mathematics 
items (Form G) 

4 science and 6 mathematics items (Form G) 

5 science and 5 mathematics i.ems (Form W) 
Form G 100 items 10 social science, 6 science, and 4 
(AS.A T-Ci) mathematics items (Form Y) 

10 humanities, 5 science, and 5 mathematics 
items (Form Z) 

l orm H 100 items 16 humanities, 10 science, and 5 niaihematics 
(ASAT-H) items (Form V). 

14 humanities, 6 social science, and 9 science 

items (l orm W) 

5 science and 3 mathematics items (Form Y) 
5 science and 5 mathematics items (Form Z) 

I [lis arrangement of link units among the forms allowed the investiga- 
tion of fcirm equating at the whole test level and at the Mathematics, 
Science, Quantitative, and Verbal sub-test levels. The link structure is 
shown schematically in Figure 1. 
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Whole Test Links 
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Figure 1 IJnk Structure for ASAT Kquatings with Translation CoEi- 
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In order lo preserve the unit .siruciure of the ASAT when equaling 
forms, entire units were initially selected to provide the links between 
forms rather than individual items. In all cases the link units had average 
item facilities and average i.em point-biserial discrimination values 
which were neither too large nor too small. 

Given the amount of material to be trial tested, the samples of students 
available for analysing the total test data, and the availability of a 
relatively short testing time (of I hours), it was not possible to include 
more link items within the trial forms. 

Calibration Samples 

For the ASAT-G and ASAT-H, the calibration samples were two 
separate groups of randomly selected Year 12 students who sat the tests 
in the Australian Capital Territory in September of 1978 and 1979, 
respectively; for Forms V, W, Y, and Z the calibration samples were 
Year 12 students who participated in the trial testing of the ASAT-H in 
l asinania and South Australia in March 1979, 

Compuler Program for Rasch MeasuremenI 

The program used was CALFIT-3, a computer program adapted by R. 
Wines and D. Keuneniann from one designed by B. Wright and R. Mead 
(Cornish, 1976). 

CALFIT-3 estimates item difficulties and person abilities of the Rasch 
Simple Logistic Model using the corrected unconditional maximum 
likelihood statistical procedure (Wright and Panchapakesan, 1969). In 
addition it estimates how well an item conforms to the Rasch model. 
CALFIT-3 also estimates a probability of sub-test fit w^hich indicates how 
well a group of items conforms to the model, as items are accumulated 
one by one into a sub-test, starting with the best fitted item. 

The program performs its calculations in two cycles. First it calibrates 
all the items, omitting items which everyone answers correctly or 
everyone answers incorrectly, and then estimates person abilities after 
deleting persons with zero or possible maximum raw score. In the next 
cycle it gathers the best-fit led items, according to a probability of sub-test 
fit cut -of I provided by the user, recalibrates this group of items, and pro- 
duces revised estimates of person abilities. 

iVIelhod of Kquatin^ or Linking 

This was the Rasch common item method of equating tests wliich is 
described in detail by Wright (1977) and Wright and Stone (1979), 

Suppose Test a and Test /) share a common set of A' items, called the 
link items, in the Rasch common item method of equating two tests, the 
scale of the latent variab'e of one of the tests, say Test is adjusted to 
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the scaie ot* the latent variable of the other test, Test a, using the 
dilVercnce in average estimated diflicuhies of the common items from the 
two separate calibrations to translate the difficultv estimates of Test /; to 
the scale of Test a. Providing the common items and the other items in 
both tests conform to the Rasch model, and are calibrated to the same 
latent variable, thi,s method yields a pool of calibrated items whose 
esiimaied difficulties are on a common scale. 

A summary of the main elements of this method, proposed by Wright 
(1977) and Wright and Stone (1979) follows, 

1 Begin by separately calibrating the items in T(?st a and Test /;, which 
give two independent sets of estimated item difficulties for the link items, 
Ix'i and r/,/, represent the estimated item difficulties of the /th item in 
ihc link, in Test a and Test /; respectively, 

2 Calculate the translation constant which clTectively translates all 
item difficulty estimates from the calibration of Test /; to the calibration 
scale of Test a, using the formula 

U= i: UL'dJ/K. 

This translation constant is the difVerence in average estimated item 
difficulties of the common items in the two calibrations. The standard 
error of the estimated translation constant, SE(^J is approximately 
3.5 '(/VA) where N is the calibration sample size of the link items and A' 
is the number of items in the link. Unfortunately this expression for the 
standard error of the translation constant applies to the situation where 
the link items are calibrated in a separate test, taken by N examinees. In 
this investigation the link items were placed in two separately calibrated 
forms and hence the formula did not apply because the calibration 
sample sizes dilfcred for the two forms. However, so as to obtain an ap- 
proximation tor this error the value of N in the expression was arbitrarily 
taken to be the smaller of the two calibration sample sizes, 

3 The validity of the link between Test a and Test /; may be tested 
usitig the statistic 

12 (A 1) . 

whicli i^ distributed approximately as a chi-square with A' degrees of 
freedom. Alternativ ely the validity of the link may be tested by determin- 
ing the mean and standard deviation of the standardized residuals 

•V/) 

where 5,> (,S7:(r/.J' + .V£VAh)') , to see if .*^ese estimate the expected 
mean equal to zero and expected standard deviation equal to I, 
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4 The validity of an item in the link may be tested using the statistic 

which is distributed approximately as chi-square with one degree of 
freedom. 

5 Alternatively the validity of the link and the items in the link may be 
ascertained visually by plotting the estimated difficulty estimates of the 
common items from the two calibrations, and observing the extent to 
which the points are scattered about the line of perfect agreement. 

6 If three or more tests are linked so as to form a closed loop, the con- 
sistency of ^hc links may be tested by summing the corresponding 
translation constants around the loop and examining whether this sum 
estimates zero within one or two standard errors of this sum. For 
example, if Test a. Test b, and Test c form a loop, then 

The standard error of the sum may be estimated using the expression 

3.5/ ^ . > . M , 

where* in (his study /V^^, etc. were taken to be the smaller of the two 
calibration sample sizes, and K^t,. etc. are the number of common items 
in the links. 

RESULTS AND DISCUSSION 
Only the calibration results based on the whole test, and the 
Mathem. .»cs, Science, Quantitative, and Verbal sub-tests are presented 
here. The results for the other sub- tests were similar, and have con- 
sequently been omitted. 

In all the item calibrations undertaken, the major reason for some 
items not conforming to the Rasch model was the item discrimina- 
tion — the observed discriminations were either too largc^r they were too 
smalL their values departing markedly from the model value. 

Table 3 reports the percentages of fined items of tests and sub-tests ac- 
cording to the subject area of the items. Considering the results for whole 
lest calibration^, and all items in the form, greater percentages of item^ 
in the trial forms were titted than in the hnal forms, ASAT-G and 
ASA I -H. This rcsuit is not unexpected for it reflects the sample size sen- 
sitivity of the chi-squared method of assessing item and sub-test fit. The 
larger the calibration sample size, the more likely will small discrepancies 
between the observed and estimated item characteristic curves be found 
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Table 3 Percentages of Filled Hems" of Tesis and Sub-lesis Accord- 
ing lo Ihe Subjecl Area of Ihe Hems 



Percentage fitted Number of items in subject area 



Test/ 






Social 








Sub-ie^ 


f orm 


Humanities 


science 


Mathematics 


Science 


All 


v> noie 


A t.' \ r f ' 


CT / in 


ij/ IK) 


53/30 


70/20 


62/100 


test 


A A T I J 






50/20 


97/30 


75/100 




V 


UJ / lA 




/o/ 1 / 


9U/ IK) 


OA /I % 
OV/ / I 




v> 


CS / ' *: J 


Q,l / t A 

y4/ 1 0 


/ // 1 J 


TA 


88/72 




V 
1 


C) ^ 1 Q 


ti 1 1 S2 

cSJ ' in 


1 n/\ i 1 T 

1 \J\)f \ 1 


94/ 1 8 


93/72 




Z 


83 ^23 


89/ 1 9 


80/20 


80/ 1 0 


83/72 


Mill hematics 


Ar>A 1 






63/ 30 






sub-iC'^t 


/\r>A 1 - n 






C/\ / TA 








V 






TA / 1 1 

lb/ 1 7 








v> 






/ // 1 3 








1 














/ 






90/20 






Science 


A^A I -Li 








90' 20 




sub- 1 est 


\ c \ r v\ 
i\>if\ \ -ri 








97/30 






V 








90/20 






v\ 








l(K)/20 






v 
1 








89/ 1 8 






/. 








1 AA ^ 1 A 




Quantitative 






79/14 


57/30 


62/! 3 


63/57 


sub-test 


ASAF-H 




1 00 4 


65/20 


92 26 


82 50 




V 






59/17 


82/ 1 1 


68/28 




v\ 




83 6 


92/13 


89/18 


89/37 








1(K)'3 


94/17 


100' 13 


97 33 




/ 




67 6 


90/20 


50/8 


76/34 


Verbal 


ASAF-C, 


70 30 


83 6 




83 6 


74 42 


sub- 1 est 


ASAF-H 


83 30 


85 '13 




100/4 


85/47 




V 


94 34 






100 9 


95 43 




w 


91 23 


KK) 10 




100 2 


94 35 






95 19 


100 12 




100 5 


95/36 




/ 


78 23 


100 2 




1(X)'2 


86. 37 



( u' oil loi ptohahiliiv ot sub-icsi til \^as O.C^i, 

to be significant. Actually the calibration sample si/:es of ASAT-G and 
ASA T-M were appreciabi\ greater than those of the trial forms. Tfhj 
percentages 6\ items fitted in the trial forms were about the same. 
Moreover the ASA T-Ci items fitted less well as a group than the ASAT-H 
items, because fewer humanities and science items in the test conformed 
to the model. In terms of the subject area, the group of items which fitted 
worst of all were the mathematics items. On the basis of these results it 
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seems, perhaps with the exception of the mathematics items, that accep- 
table percentages of items were fitted from the pool of items available in 
each form. Obviously, had the pools of items been sufficiently large it 
would have been easy to calibrate and fit Rasch-like items to give 
calibrated whole tests of any desired length. 

The results in Table 3, for the calibrations of the items in the Quan- 
titative and Verbal sub-tests, indicate that generally slightly more items 
were fitted than in calibrations based on the whole test. However, the 
pattern in the percentages of fitted items, according to the subject area of 
the items, was not entirely consistent across the forms. In some cases a 
greater percentage of items was fitted in a subject area than was the case 
with calibrations based on the whole test, and in some cases th^ situation 
was reversed. In the Quantitative and Verbal sub-tests there was no ob- 
vious pattern in the subject areas of the better fitted items. That is, the 
rank order of the better fitted items in both sub-tests did not show a pat- 
tern of preferences for any of the subject areas from which the items 
were drawn. 

Calibraiions of items in the Mathematics and Science sub-iesis in 
general titled greater percentages of the available items than did calibra- 
tions based on the whole tests, For example, 10 per cent more 
mathematics items in ASAT-Cl were fitted in calibrations based on the 
Mathematics sub-test, and 30 per cent more were fitted in ASAT-H than 
in the calibraiions based on rhc whole tests. 

A tentative generalization is that a greater percentage of ASAT items 
will conform (o the Rasch Simple Logistic Model, if the items in each 
subject area of the ASAT are calibrated independently. Apparently the 
items in each subject are more closely represented in terms of a 
unidimensional latent variable, from the point of view of ihe Rasch 
Simple Logistic Model, than are the items in the 'impure* sub-tests which 
contain items from two or more ditVereni subject areas. The problem of 
dimensionality is not simply a matter that deals with the subject area of 
the items, but rather one of identifying those abilities and skills, forming 
the latent variable, that are common to the group of items which must 
explain consistent examinee performance on the lest. 

Table 4 presents siaiistics of item ditliculiy estimates of the fitted items 
for the whole and sub-lesis calibrated. The mean of the item difficulty 
estimaa^^ is zero in each case, and fixes the origin of the calibration scale. 

At each test si!b-test calibration level, the ranges of the estimated item 
ditficnlnes, for most forms, are quite similar to each other. This, together 
with the fact ifiat the standard deviations of the estimated item difficulty 
estimates are much larger than the average standard errors of the.se 
estimates, suggests thai the items in the tests/ sub-iesis were sufiiciently 
scaiiered on the calibration scales to give the latent variables direction. It 
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Table 4 Calibration Results for Fitted Items- of ASAT Forms According to Tcst/Sub-test Calibrated 



test/ 
Sub-test 

Whole 
test 



Mathematics 
sub-test 



Science 
sub-lest 



Quantitative 
sub-lest 



Verbal 

sub-test . 
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Form 

ASAT-G 
ASAT-H 

V . 

W 

Y 

Z 

ASAT-G 
ASAT-H 

V 

W 

Y 

Z 

ASAT-G 
ASAT-H 

V 

W 

Y 

Z 

ASAT-G 
ASAT-H 

V 

W 

Y 

Z 

ASAT-G 
ASAT-H 

V 

W 

Y 

Z 







Statistics of item difficulty estimates'" 






Percentage 






- ■' - — 


- • ■■ - - - ■ 




Calibration 


of fitted 




dmax 

- . , . 






Average 


sample 


items 


Range 


SD 

■ - 


SE(d) 


jize 


62 


-2.26 


2.36 


4.62 


0.91 


0.13 


312 


75 


-2.05 


1.56 


3.61 


0.71 


0.12 


308 


89 


-2.98 


2.11 


5.09 


0.96 


0.14 


240 


88 


-2.07 


2.00 


4.07 


0.87 


0.14 


247 


93 


- 1.64 


1.83 


3.47 


0.86 


0.14 


241 


83 


-2.92 


2.18 


5.10 


0.94 


0.14 


238 


63 


- 1.45 


1.80 


3.25 


1.04 


0.13 


336 


80 


- 1.22 


1.75 


2.97 


0.85 


0.14 


309 


76 


- 1.53 


1.33 


2.86 


0.98 


0.19 


192 


77 


- 1.83 


1.60 


3.43 


1.21 


0.21 


147 


88 


- 1.72 


1.04 


2.76 


0.80 


0.15 


225 


90 


-3.42 


- 2.24 


5.66 


1.62 


0.18 


231 


90 


-2.04 


1.41 


3.45 


1.03 


0.!3 


303 


97 


- 1.88 


1.60 


3.48 


0.74 


0.13 


288 


90 


-2.69 


2.52 


5.21 


1.14 


0.15 


232 


100 


- 1.41 


0.98 


2.39 


0.64 


0.15 


239 


89 


- 1.80 


1.69 


3.49 


0.88 


0.15 


238 


100 


-0.74 


0.81 


1.55 


0.54 


0.15 


232 


63 


-1.95 


2.35 


4.30 


0.97 


0.13 


304 


82 


-2.16 


1.60 


3.76 


0.70 


0.13 


343 


68 


- 1.77 


1.71 


3.48 


1.01 


0.17 


225 


89 


-2.29 


1.92 


4.21 


0.95 


0.15 


236 


97 


- 1.58 


1.65 


3.23 


0.76 


0.15 


236 


76 


-3.30 


2 18 


J .*40 




0 15 


225 


74 


-2.12 


1.38 


3.50 


0.91 


0.13 


280 


85 


- 1 .43 


1.28 


2.71 


0.66 


0. 12 


345 


95 


-2.' 73 


1.65 


4.38 


0.89 


0.14 


235 


94 


-1.53 


1.62 


3.15 


0.80 


0.14 


247 


95 


-1.50 


1.48 


2.98 


0.88 


0.15 


237 


86 


- 1.31 


1.83 






0.14 


238 



O 
O 



t3 
5 

2 
;3 



:3 



"Cut-ofl' for probability of sub-tc.st fit was 0.01. * Measured in logits 
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appears thai cacli lonn was sdincwiial succcssrul in providing ciioiigli 
ileiiis tor eacli icsi suh-lcsl. prcnidiiig usctul yardsticks against wiiicli 
llic abililics and skills of c\aniinccs could be assessed in the continuum ol" 
the rele\ani latent \ariable. Fio\ve\or, the eti'ective ranges of the 
esiniialed item dilticuities arc tiot as large as were expected for an omni- 
bus lest like the ASAl . it is not surprising that the ASAT test items 
measure scholastic aptitude along a narro\^ range of the potential 
scholastic continuum, because the lest consiruchon procedures currently 
in use select items with laciiities ceniied around 50 per cent and generally 
exclude items w hose faciliiics are below 20 per cetM or greater thati 80 per 
cent. 

I able shows the range ol examinee abilities tor ASAT-Ci and 
ASAI-H at the test sub-lesl calibration levels. Without exception, the 
range ot" estimated abililics was greater than the range of estimated item 
dilliculiies. I he maich between estimated abililics and estimated item 
dilhculties, measuied in terms of their overlap, was not entirely satisfac- 
tory lot etiicicni mcasinenK'nl practice. As can be seen in the examples of 
ASAI-(i atui ASAI-H ( fables 4 and 5), the whole tests and sub-tests, 
with the exception of the Mathematics sub-tests, were somewhat too easy 
for the calibi atuMi sample. In the case of the Mathematics sub-tests, they 
were too ditliculi tor some siudetils, matched to the abilities of some, atid 
far too cas\ for the rest of the cahbralion samples. Similar results were 
obtained with ihc irial forms. 

K(|uatin^ Anal>ses 

Unfortunately, as it turned out. insufficient numbers of link iteius were 
lilted in sonte forms, and this undoubtedly affected the validity of the 
subsequent ecjuaiing atialyses. 

The results of the equating atialyses, for the links illustrated in Figure 
1 , are reported in Table 6. The forms have been statistically linked al the 
whole lest level, and the Mathematics, Science, Quantitative atui Verbal 
suo-iest levels, tniriher details of the results of equating are given in 
Fables A.l to A. 4 in the Appetidi.x to this paper. 

In Table 6 are reported estitiiates of the translatioti constant atid its 
estmialeii slatidard error SL•(/,.^), for the situation where l orm a is linked 
to the calibration scale detcrmiticd by Form h. The standard devialioti of 
the ditVerence in the esiimatcd item dilliculiies of the link items in the two 
itulependent calibrations, SD(r/,. - ^A,), provides a measure of the 
coherence of ihc two sets of estimated item difTicultics. The .smaller the 
SD{(i. fA), the less Mioise' thete is in the link. Links with a lot of noise 
might result from calibratiotis which have defined two difVeretit latent 
variables, perhaps to the extetit that one or both ealibra'ions were based 
i:> part on cxlraticous variables. 




2Lr3 



Table 5 Kstimiites of Range of Examinees' Abilities for Filled Hems of Tesls/Sub-lesIs of ASAT-G and 
ASATH 



f-orm lesi Siib*tcsi 

ASAF-d Whole icsi 
Maihcnuiiics 

sub-icsi 
Science 

suh-icsi 
Ouuniiuuive 

sub-icsi 
Verbal 

sub- ICS I 

ASA i n Whole tcsl 
Mai hematics 

sub-icsi 
Science 
^ sub- ICS I 
Ouaniiiaii\c 

sub-lcsi 
Verbal 
sub-iesi 



MiniiTuim 
h Raw score 



t 1 1 


14 


" 2.49 


2 


1;93 


3 


1.66 


7 


- 1.91 


5 


Ml 


20 


2.17 


2 


1.73 


5 


- 1.55 


8 


1 .06 


11 



Maximum 



It 


Raw score 


J. to 




1 !<; 

J. J J 




3,20 


17 


4.05 


35 


3.71 


30 


2.64 


69 


3.02 


15 


2.35 


26 


3.18 


35 


2.09 


18 





Hsiimaied 




range of 


Min 


examinees* 


SE(/?) 


abiliiies 


0.28 


5.21 


0,51 


5.84 


0.52 


5.13 


0.36 


5.71 


0.39 


5.62 


0.24 


3.75 


0.54 


5.19 


0.39 


4.08 


0.33 


4.73 


0.33 


3.15 



Mcasuroti in louits. 
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Table 6 Results of the Equating Analyses 



Tesi 
Sub-iesi 



Whole lesl 



Mathematics 
sub-iest 



Science 
sub-lest 



Quantitative 
sub-test 



Verba! 
sub-test 



p. 


Prarfinn of 

1 1 civ I l^-fl i V 1 








Smaller 


linked 


link items 








calibration 




used 


SD(fy.. -ry,) 




SE(/„.) 


sample 


b a' 










size 


Ci-Y" 


\2/2iY 


0.18 


0.25 


0.06 


241 


G-Z 


10/20 


0.19 


0.30 


0.07 


238 


Y^V 


8/10 


0.25 


1.00 


0.08 


240 


Y^H 


3/8 


0.30 


-0.01 


0.13 


241 


Z — W 


7/10 


0.20 


0. 19 


0.08 








0 sn 


— 0 I s 

U. 1 J 


0. 13 


238 


V u 

V — rl 


LLt J 1 


n i\ 


0.03 


0.05 


240 


W — rl 




yj.i. 1 


- 0.29 


0.05 


247 


G^Y 


4/12 


0.31 


-0.37 


0.12 


225 




3/5 


0.31 


0.27 


0.13 


231 


Y-H 


2/3 


0.04 


-0.23 


0. 16 


LLj 


/. vv 


jf J 


u. uo 


U.U 1 


0. 17 


147 


y UI 
/- — rl 


S /A 

J/ 0 


U. J 


- 0.82 


0. 10 


231 


V — rl 


d/S 

f / J 




- 1 .07 


0. 13 


192 


Ci-Y 


4/6 


0.18 


0.48 


0.1 1 


238 


Y^H 


4/5 


0.26 


-0.30 


0.11 


238 


Z-W 


4/6 


0.21 


-0.02 


0.11 


232 


Z-H 


4/4 


0.53 


0.02 


0. 1 1 


Til 


V ^ ri 


/ / 1 u 


U.Oj 


0 7Q 


0.09 


232 


W H 


//y 




— U. 1 o 


0 OQ 


239 


G^Y 


9/12 


0.24 


0.26 


0.08 


236 


G^Z 


3/6 


0.19 


0.53 


0.13 


225 


Y^H 


3/7 


0.15 


-0.05 


0.13 


236 


z-w 


7/10 


0.19 


0.20 


0.09 


225 




5/10 


0.44 


-0.47 


0.10 


225 




6/12 


0.24 


-0.59 


0.10 


225 


W^H 


7 MO 


0.37 


-0.33 


0.09 


236 


Ci^Y 


4.'7 


0.09 


0.35 


0.11 


237 


G^Z 


7/12 


0.22 


0.21 


0.09 


238 


Y^V 


8/10 


0.25 


0.87 


0.08 


235 


V-H 


13/19 


0.25 


0.26 


0.06 


\ 235 




14/18 


0.25 


"0.25 


0.06 


247 



' lH)rm Y linked to ttic scale determined by l-orm Ci. 

" (>t the 20 link items originally calibrated, the 12 best fitted items were used in the link 
analysis. 

Only the better fiited link items were selected for calculating the 
tratislation constants. Some of the links proved to be very tenuous, con- 
sisting of only two or three items, but the estimated translation constants 
for these links were in general agreement with other translation con- 
stants, as was shown in the assessment of link coherence. 
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Since the standard error of a translation constant is approximately in- 
versely proportional to ihe product of the number of items in the link 
and (he si/e of (he smaller calibration sample, a link with a few items 
could still give a respectable standard error providing the sample was 
large enough. But ii seems too much to hope that links consisting of a 
tew items can be stable in most practical situations, especially if the links 
contain much noise. 

An interesting result of this investigation is that the translation con- 
stants at the lest sub-fest levels are mostly small, and in many of the 
equatings closer ihan about three standard errors from zero. 

Kxampks ot Kquarin^ Analyses 

To illustrate the process of equaling ASAT forms, results are presented 
for the equaling ai the whole test level of Form W to the scale determined 
by Form Z, and for the equating at ihc whole test level of Form V to the 
scale determined by Form Y. 

Separate calibrations of items, using test data from the calibration 
samples, were carried out for ihe four forms, producing item diHiculty 
esiiniaies and estimaies of iheir standard error. The nuniber of students 
in the calibraiions were: 247 for Voxm W, 238 for Form Z, 241 for Form 
Y. and 240 for Vox\w V. Sixty-two of the items of Form W and 59 of the 
items of Form Z were successfully calibrated usiag a probuoility of sub- 
test \\\ cut-off equal to 0.01, while for Forms Z and V ihe number of 
items successfully calibraied were 60 and 63, respectively. 

K(|i;jlinj: Form W \o Form Z 

Items 5Z and 72Z of f orm Z and iiems 72VV and 5W of ! orm \V were 
omitted from the equating process because they failed to tu the Rasch 
model in one of ihe se[)arate calibrations. Item 68Z of Form Z (item I W 
of Form W) was (>mitied because it fell outside ihe 95 per ceni confidence 
region (see I able A. I ). 

C on'sequenily, of the original ten iiems in the link, seven were used to 
calculate ihe equating consiant. Fable A.l sets out the stages in ihe 
calculaiion of the equating constant (C, 0.19), and the lest of the 
validiiv of Ihc link. I he link is siaiisiically valid since the standardized 
residuals in lable A. 1 are distribuied wiih approximately zero mean and 
approximaielv unit siandard dev iaiion. The obtained mean ( - 0.05) and 
standard deviation (0.82) do not ditler appreciably from the expected 
values. 

Ihe iwo estimated dillicnlties of the link items were transformed to a 
common scale deiermined by I'orm Z, first by adding the translation con- 
stant to the esumate t'A, of Form W, and then finding the average of this 
new value and the estimaies rA. These averages (indicated by 
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superscripr) are shown in Fable A. 2 under d: and f/;. In transforming 
the rest of the items in the two forms lo a common scale, difficulty 
estimates of items in Form Z not in the Hnk remained unchanged, while 
difficulty estimates of items in Form W were increased in value by 0.19 
(the translation constant). 

Equatin{> Form V to Form Y 

Item 68Y of Form Y and item 6V of F-orm V were omitted from the 
equating process because they failed to hi the Rasch model in both 
calibrations. Item 5V of Form V (item 67Y of Form Y) was omitted 
because it fell outside the 95 per cent confidence region (see Table A. 3), 
Only eight items were used to calculate the equaling constant and to test 
the validity of the link. Tables A. 3 and A. 4 show the results of the link- 
ing exercise for Forms V and Y. 

Consistency of the Links 

Figure I shows a number of dosed loops joining three or more of the 
forms at the whole test and Mathematics, Science and Quantitative sub- 
test levels. If the sum of estimated translation constants for a certain 
loop should estimate zero within one or two standard errors of the sum, 
the links in the loop are said to be statistically consistent. In essence this 
kind of information supplies additional support for the validity of the 
links making up the loop. 

Table 7 shows this sum and its associated standard error tor eleven 
loops. In all cases, except loops containing the link Y-V at the whole 
test level and Loop 1 1 , this sum is within one standard error of zero. In 
Loop 1 1 the sum is approximately one standard error away from zero. It 
appears that the links in these loops, at the test/sub-test calibration 
levels, are consistent at least in terms of the criteria proposed by Wright 
and Stone (1979). 

The loops containmg the link Y-V appear to be inconsistent. Since 
the link items in Y- V; which comprise a unit of 10 humanities items, are 
the first unit in Form V arid the last unit in Form Y, ii is possible that the 
position of the unit in the two forms affected the estimation of the 
translation constant. Indeed the average facility of the link items was 40 
per cent in Form Y and 57 per cent in Form V; and for Oie eight best 
fitted items it was 43 per cent and 63 per cent, respectively. ^1 he diflercnce 
in individual item facilities was almost constant between the two torms, 
ranging bctwetMi \5 and 20 per cent. Moreover both calibration samples 
were comparable in terms of their composition. The relatively large 
translation constant retlects the tact that the link items in Form Y were 
estimated to be more difficult than in Forn* V. These results suggest that, 
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Table 7 



Kvalualion of the Consistency of the Links in the Loops Displayed in Figu 



re 1 



Icsi Sub- 1 est 
Whole test 



I oop 
nunibcr 





2 




3 




4 




5 




6 


Mathematics 




sub- lest 


7 


Science 




sub-iesi 


8 


Ouaniiiative 




sub-test 


9 




10 




1 1 



I. oop 

G^Y^H^Z^G 
Y^V^H^Y 

G^Y^H^Z^G 

Z^W^H^Z 

G^Y^H^W^Z-G 

G^Y^H^Z^G 

Z^W^H^Z 





Sltind;ird i^rrc^r 

V J i 11 1 1 VICi 1 v4 ^ 1 1 \J 1 


Sum of the 


of the sum of 


translation 


the transL'iiit^n 


constants 


constants 


- 1,08 


0,39 


-1,13 


0,39 


-0.04 


0.39 


-0,09 


0.39 


1.02 


0.26 


- 0.05 


0.26 


- 0.02 


0.51 


0.22 


0,31 


0.19 


0.52 


-0.15 


0.44 


-0.34 


0.28 



(inconsistent)" 
(inconsistent) 



(inconsistent) 



The links in the loop are eonsisient i( ihe sum of ihe fanslaiion constants estimates 



zero within one or two standard errois. 
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in this case, the item calibrations and form equatings based on the trial 
test data may be somewhat unreliable. 



The results obtained in this investigation indicate that it is feasible to 
calibrate the ASAT and to equate its forms using the Rasch Simple 
Logistic Model. However, if the ASAT test is to be prepared on the basis 
of Rasch nieasuremenl principles, the existing program of test develop- 
ment, as exemplified in the test's list of specifications, will need to be 
modified to allow the preparation of Rasch-like te^^ts. From the percen- 
tages of fitted items at the whole test and sub-test levels, it is clear that 
larger pools of items in each of the subject areas would be required than 
are now available, if test lengths of 100 items are to be achieved. 

A crucial and important aspect of Rasch nieasurement is the assess- 
ment of item ht to a *unidimensionar latent variable. The Rasch model 
assumes that only one latent variable exists, but it might be reasonably 
argued that with tactorially complex tests like the ASAT, more than one 
latent variable is really needed to explain the complex pattern of test 
responses. Perhaps the way round this problem is to break up the test 
into homogeneous parcels or sub-tests each of which can be .safely 
characterized by a single latent variable. But this action might not 
guarantee unidimensional sub-tests, because a single item may measure 
many kinds of abilities and skills. This is a real dilemma for the test 
developer who has to construct and arrange test items into meaningful 
and useful parcels. 

The results seem to indicate that the Rasch Simple Logistic Model will 
attempt to fit to a common latent variable any group of items that cohere 
in some fashion. It will attempt to do this on statistical grounds and, will 
pick up as the latent variable a kind of lowest common denominator. 
This observation is in agreement with the findings of Reckase (1979), 

Finally the limited results of the equating analyses suggest that the 
ASAT may be equaled at the whole test level and various sub-test levels. 
Unfortunately the stability of some of the links is questionable because 
they consist of very few items. Nevertheless a general picture has 
emerged which should lend some support to those interested in applying 
Rasch nieasurement principles to fticiorialiy complex scholastic aptitude 
tests like the ASAT. 
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APPENDIX 



Table A.l Calculalion of the Translation Constant (c) and Test oi the ValidiU of the Link Z-W 
Test Equating (Excluding Items 5Z, 68Z, 72Z, 72 W, IW, 5W) 



Whole 



Item 

1/ 
2/ 
3Z 
4/ 
5Z 
68/ 
69/ 
70/ 
71Z 
72/ 



l orni / 

StXcL) 



0.87 
1.98 
?J8 
2.18 
1.04 
0.40 
0. 1 3 
1.14 
1 .09 



0.16 
0.22 
0.24 
0.24 
0.16 
0.14 
0.14 
0.17 
0.16 



Mean 1,33 
SI) 0.85 
TranNlaiioii consiani 





l*orni\V 








Item link 












D 




(It 






Iiem 


(L 


SEicL) 


(L ci 


/'^ ^. . 


C Hi 






68W 


0.91 


0.16 


0.04 


0.23 


1.23 


0.23 


1.00 


69W' 


1.62 


0.19 


0.36 


0,17 


0.67 


0.29 


0.59 


70W 


1,86 


0.21 


0.32 


0.13 


0.39 


0.32 


0.41 


71W 


2,(X) 


0.22 


.0.18 


(),01 


0.0 


0.33 


0.03 


72\V 


/i 














IW 


- 1.26 


0.14 


0,86 








0.53 


2W 


- 0.22 


0.13 


0.09 


0.10 


0.0 


0,19 


3W 


0.68 


0.15 


0,46 


0.27 


1.70 


0.23 


1.17 


4W 


1.12 


0.17 


0.03 


().22 


1.13 


0,23 


0,96 


5W 


0.55 


0.15 














1.14 




0,19 


0.(X) 






0.05 




0.77 




0.20 


0.20 






0.82 



C/5 



C5 

Co 



Co 



i: a 7 = 0.19 



C HI Is distnbuiai appf o\imuicl> as sviih I degree of treetiom (Wriuhi and Sionc, 1979. p. 96). 
item not imcd when ealihraied. 
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Table A J Hem DiHicully Kstimales for Form W and b^orm Z from 
Initial Calibrations and upon Translation to a Common 
Scale Determined by Form Z: Whole Test Equating 

(Estimates lor first 20 items of each form are shown.) 







Form W 








Form Z 




hem 








Item 










^ (L 


S[:(r/J 


(t 






SE(r/.) 




IW 


1.26 


0.14 


■0.74 . 


IZ 


0,87 


0, 16 


0 99" 


2W 


- 0,22 




0.08" 


2Z 


1,98 


0,22 


I 9Q" 


3\V 


0,68 


o...< 


I.OI" 


3Z 


2, 18 


0,24 


2, 1 2" 


4W 


1,12 


0,17 


1.20" 


4Z 


2,18 


0,24 


2, 1 9" 


5\V 


0.55 


0.15 


omil 


5Z 


1,04 


0, 16 


omil 


6W 


2.07 


0.17 


1 88 


6Z 


- 1,54 


0,16 


- 1,54 


7W 


0.76 


0.13 


0.57 


7Z 


-0,84 


0,14 


0,84 


8\V 


1.20 


0,14 


1.01 


8Z 


0,69 


0,14 


- 0,69 


9\V 


0.67 


0,13 


0 48 


9Z 


- 1,25 


0,15 


- 1,25 


low 


1.06 


0,16 


1.25 


lOZ 


0,84 


0,15 


0,84 


IIW 


1.20 


0.14 


1.01 


IIZ 


- 0,29 


0,15 


-0,29 


I2W 








12Z 


-0,60 


0,14 


-0,60 


i.nv 


0.34 


0.14 


0.53 


13Z 


"0,48 


0.14 


-0,48 


I4\V 








14Z 


0,39 


0,14 


0,39 


I5W 


0.17 


0. 1 3 


0,02 


15Z 


- 0,60 


0,14 


-0,60 


16W 


0.37 


0.14 


0,56 


16Z 


-0,42 


0,14 


- 0,42 


17W 


0.20 


0.14 


0.39 


17Z 


-0.73 


0,14 


-0,73 


18W 


0.18 


0.14 


0.37 


18Z 






/• 


19W 


0.15 


0.14 


0.04 


19Z 


0,41 


0,14 


0.41 


20W 


1.14 


0,17 


1,33 


20Z 




u 





fA, d 1 1 em tlinicuhv estimates in Idgits 

Sl (f/.,), Sl ((/ ) >taiulard errors ot esiinuiies in logits 
ft. (t .i\eraue tlitViculis esinnaies on coinnu)n scale in kiuits 

■ Item oimitetl Irotti conunon scale because it was not fitted in both calibia- 

'•ons 

used t(^ calculate the tiaiislation constant 
item not fitted utien calibrated 
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Table A.3 Calculation of the Translation Constant and Test of the Validity of the Link Y 
Test Equating (Based upon the First Four and Last Four Items in the Table) 



^V: Whole 





F orm Y 




Item 






63 Y 


0.03 


0.13 


64V 


0.77 


0.14 


65 Y 


0,02 


0.13 


66Y 


-0.79 


0.14 


67Y 


1.07 


0.13 


68Y 


1 .05 


0.15 


69Y 


0.03 


0.13 


7{)Y 


1 .03 


0.15 


71Y 


0.21 


0.14 


72Y 


0.92 


0.15 


Mean 


0.42 




SD 


0.46 







Form V 








hem Link 
















Fii. 




Z, 


Hem 




) 

• SE(/A) 




D- r, 


CHI" 




IV 


1.21 


*0,15 


1.24 


0,24 


1,34 


0,20 


1,20 


2V 


0.(X) 


0,14 


0,77 


- 0,23 


1,21 


0, 20 


- 1,15 


3V 


- 0.81 


0,14 


0,79 


- 0,21 


1 ,U1 


U, 1 7 


1,11 


4V 


-2,17 


0,20 


1,38 


0,38 


3,30 


0.24 


1,58 


5V 


0.98 


0,15 


0.09 










6V 


h 










0,20 


0.55 


7V 


' 1,08 


0,15 


"1 ,11 


0,11 


0,28 


8V 


0,17 


0,14 


0,86 


- 0.14 


0.45 


0.21 


- 0.67 


9V 


- 0,93 


0,14 


1,14 


0.14 


0,45 


0,20 


0,70 


lOV 


« 0,24 


0,14 


0,68 


-0,32 


2,34 


0,21 


- 1.52 




- 0.52 




1,00 


- 0,(X) 






0.05 




0,63 




0,25 


0,25 






1.12 



Franslalion constant ^, ^ H A 8 - 1 .00 



:5 
o 

I 



( rn IS dtMnbulcd appro\irna!ciy as \' \M'th i degree ot tVccdoin (Wright and Stone, 1979, p. 96). 
!icm not tilted \shen calibrated. 

' to 
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lable A.4 Item Difficulty KsHmutes for Form V and Form Y from 
Initial Calibrations and upon Translation to a Common 
Scale Determined by Form Y: Whole Test K<|uating 

(Esilmaies for first ten and last ten items of each form are 
show n.) 







I'orin V 








Wmw Y 




1 (L in 


H 








(i^ 




d\ 


1 > 


1 . J 


n 1 s 


- U.UV 


1 V 


-0.71 


0,14 


-0.71 


■>\.' 






0.88" 


2> 


0,03 


0.13 


0.03 


> 


i\ w 1 

U . O 1 


i\ 1 1 


(J. ox 


3Y 


0.87 


0,14 


-0.87 


AV 


"> 1 7 




0.98" 


4Y 


1,64 


0,17 


- 1.64 


* S V 


U, VfS 


0. 1 J 


1 . 96 


5>' 


0.98 


0. 1 5 


0.98 


6V 








O T 


1 .29 


0. 16 


1.29 


7V 


1.08 


0.15 


■ 0.03" 


7V 


0.54 


0. 14 


- 0, S4 

u 


K\ 


0.17 


0,14 


1,10" 


8Y 






9\ 


0,93 


14 


0.14- 


9Y 


0.23 


0.14 


-0.23 


lOV 


0,24 


0.14 


1.08" 


lOY 


1,37 


0.16 


- 1,37 


62V 


0.39 


0,14 


1.39 








(^^\ 


0.09 


0,14 


1.09 


63 Y 


0,03 


0,13 


(),I9" 


64 V , 


0.12 


0,14 


0.88 


64 Y 


0.77 


0.14 


0.88- 


65 V 


0.37 


0,14 


0.63 


65 Y 


0.02 . 


0.13 


0.08" 


66 V 


0.5H 


()J4 


0.42 


66Y 


0.79' 


0.14 


0.98" 


6-?V 


0,37 


0.14 


1.33 


67Y 


1.0^ 


0.15 


1.07 


6SV 


0.41 


0,14 


0.59 


68 Y 


1.05 


0.15 


omit 


69V 


0,76 


0.15 


1.76 


69Y 


0.03 


0,13 


- 0.03" 


7()V 


0,07 


0.14 


0.93 


70Y 


1.03 


0.15 


1.10" 


"IV 


0.33 


0,14 


1.33 


71 Y 


0.21 


0,14 


0.14" 










72Y 


0.92 


0.15 


1.08" 


(/ , (1 


IICIli 


ilitiiL'ulis cstiniaiL's iti lomis 








SI (J ). SI [(! 


>iiiiKljrd crttMs o[ 


cstiniiiiL's 


in kmiis 








>^il (I 


lis lT 


ii!L" difliL'uli\ csiimatL's iiti L'titiitiuMi scak- in 


louiis 




otilit 


HlMM 


otnillL'd trnni 




scale bccaiisc 


it \sas noi tilled in boih L-alibr 



lions 

usL'd lo eakuLnc \\w iranslaiion con^iant 
ncni no! Iiifcd hen* calibrated 



T 
\ 



I he hn()m\cnu'nt af ^ ti'asun'ttu'nt in / dma/ion and f^svchoiouv 
I dilcd b\ Donald Spear ril( 
Cop\riuh! - ACl'R 1982 
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Some Alternative Approaches to the 
Improvement of Measurement in 
Education and Psychology: 
Fitting Latent Trait Models 



Af ter some 70 years of research, the virtues and limitations of the linear 
common factor model, as a tool for the structural analysis of a battery of 
mental tests, are now reasonably well understood. In the well-known 
case ihat Spearman originally treated, we explain the covariation of a set 
of tests by supposing that they have linear regressions on a single 
variable — a 'common factor' or latent trait' — with residuals that are un- 
correlated. The Spearman case is free from those problems of rotational 
and inierpretational indeterminacy that have made some social scientists 
suspicious of factor analysis, and the model gives us a reasonable defini- 
tion of unidimensionality or homogeneity for a set of quantitatively 
scored tests. That is, if the tests fit the single-factor model 'satisfac- 
torily*, we say that the battery is 'unidimensional' or 'homogeneous' in 
the clear sense of these terms that the common factor model provides, 
Ciiven estimation of the parameters of the model by the method of maxi- 
mum likelihood, we can obtain a statistical test for the unidimensionality 
hypothesis. At the same time, the residual covariance matrix supplies a 
nonstatistical but very reasonable basis for judging the extent of the 
misfit of tfie model to the data. In practice, the residuals are, we might 
argue, more important than the test of significance, since the unidimen- 
sionality hypothesis, like all restrictive hypotheses, must be fafse, and 
will be proved so by the chi-square test on a suflicientjy large sample. If 
the residuals are small, the fit of the hypothesis can still be judged to be 
satisfactory. 

It is possible to show, but the demonstration would take us too far 
afield, that there is no psychometric distinction to be made between a 
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Spearman common factor and a generic true-score as treated in Lord and 
Novick (1968), A Spearman factor analysis can therefore be made the 
basis for a considerable amount of test-theoretic analysis, including the 
assessment of generic reliability; that is, of generalizability in the sense of 
Cronbach et al, (1972). (See McDonald, 1978a,) 

If by construction or by chance, the factor lo..dings of a set of fac- 
torially homogeneous quantitative tests are equal, the tests are essentially 
tau-equivalent in the sense of Lord and Novick (1968), and the common 
factor dlH'ers only trivially from a specific true score. The factor analysis 
then supplies a basis for the assessment of reliability in the classical sense 
of measurement error (whatever that really means). 

In its origins, latent trait theory (latent structure analysis) was 
motivated by the recognition that linear common factor analysis could 
not be carried over from the quantitative test to the qualitative test-item 
(La/arsfeld, 1950; Guttman, 1950). The central reason for this is that the 
regression curve of a binary item on any independent variable (observed 
or unobserved) represents the conditional probability of passing the 
item. (Here we use the word 'passing' for whatever is scored as the 
positive response, without intending any loss of generality.) Since the 
regression curve is a curve of conditional probabilities, it must therefore 
be bounded by zero and unity, and cannot be linear. To the extent that 
item characteristic curves-the regressions of the items in a test upon a 
latent trail -can be approximated by straight lines over the interval con- 
taining most of the examinees, we can justify the simple process of fitting 
the latent linear model, which is just the Spearman common factor 
model (La/arsfeld, 1950; Torgerson, 1958; McDonald, 1967a), and we 
can tolerate the continuing practice of factor analysing product-moment 
correlations of binary items, the so-called phi coefficients. Although 
there is some evidence that this approximation is not nearly as bad in 
practice as we might expect from theory, concern about difficulty factors 
(sec McDonald, 1965, 1967a; McDonald and Ahlawat, 1974), as well as 
the admitted theoretical inappropriateness of the linear model, has led to 
the introduction of appropriate models for binary data that are essen- 
tially counterparts of the Spearman case for quantitative variables. 

In the .Spearman case, given variables v,,7- I, . . a/, we assume that 
there exists a singl vjommon factor or latent trait a', such that the regres- 
sion curve is given by 

E\y, .v-.v,|-/;/, + /;a,, (I) 

and such thai for any fixed value .v. of A', the variables are uncorrelated. It 
is reasonable to suppose that all users of this model, if questioned, would 
say that they intend the stronger assumption that for fixed a' the variables 
are distributed independently. That is, users probably intend to assume 
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the principle of local independence (La/arsfeld, 1950; Anderson, 1959; 
McDonald, 1962). In practice it is both convenient and sufficient in 
general to test ihe weak implication that, when the common factor is par- 
(ialled out, the residual covariances are zero; that is, to test the familiar 
implication that 

-/O0\~m0]- ff'-^lJ- (2) 

where f ' ^ [/i /,] and U- is a diagonal matrix of residual variances. 

This is tantamount to ignoring possible information in the higher 
moments of the distribution of (>',, . . ., >'J. 

La/arsfeld (1950), seeking a suitable counterpart of factor analysis for 
binary data, first explored the class of polynomial item characteristic 
curves (which in principle sutler the same difficulties as the linear item 
characteristic curve). To bypass technical difficulties, he then substituted 
the latent class model for the polynomial model. With this step the cen- 
tral idea of a distribution of the latent trait or traits over a continuum of 
any dimensioiiaiity is given up altogether. (See McDonald, 1967a,) 
Lawlcy (1943) and Finney (1952) independently introduced curves that 
are actually appropriate for the regression of a binary item upon an 
observed independent variable, such as, in Lawley's case, the total test 
score. Lord (1968) attributes the basis of modern item characteristic 
curve theory to Law ley (1943) but, on one interpretation, it seems to be 
Lord (1952) himself who first combined the probit curve (the normal 
ogive) with the principle of local independence to yield the normal ogive 
latent trait model: Birnbaum, 1957-1958, (see Lord and Novick, 1968) 
gave theory for the equivalent logistic model. 

We can reasonably consider the normal ogive and logistic models as 
nonlinear counterparts of the Spearman model. This is immediately seen 
on writing the two-parameter versions of these models as 

E\y,x^x,\^N(m,+f,x,) (3) 

and 

E\y,\x^xA^nD(m,+f,x)\ (4) 

where 

MO- I \e-^' ^dz. (5) 
V 2 7r - » 

the normal distribution function, 

^(0-1/(1+^'), (6) 

the logistic function, and D is a known constant, remembering that if 
is a binary variable, coded unity for *pass' and zero for TaiP, then 
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(7) 



Equations (3) and (4) are nonlinear counierparis and indeed nonlinear 
iransformaiions of the Spearman model (1), with the iransformaiions 
chosen so as to satisfy the bounds we require on the regression curves 
when they represent probabilities. As a consequence of these transforma- 
tions, the combination of the assumed item characteristic curves with the 
P'^inciple of local independence no longer yields a simple covariance 
structure such as (2) for the relations between the variables. 

Not surprisingly, fitting a nonlinear latent trait model proves more 
difficult than fitting the counterpart linear common factor model, in the 
common factor model — except for Lawley (1942) and McDonald 
(1979) — wc treat the common factors as random independent variables, 
that is, random regressors. The test covariances yield all the information 
needed for estimating the parameters, if we are interested in factor scores 
(values of the latent traits of individual examinees), we estimate these for 
any examinee, whether from the original sample used to estimate the 
parameters of the model or not, quite independently of the estimation of 
the model parameters, in contrast, most proposals for fitting the normal 
ogive or logistic models treat the /V latent trait values a,, 1, . . ., of 
the examinees in a sample, as parameters to be estimated simultaneously 
with the iiem parameters- usually ni,,J\ in (3) and (4). That is, we treat 
the latent trait as a fixed independent variable — a fixed regressor. The 
main exception (Bock and Lieberman, 1970) uses extremely costly 
numerical procedures and is not recommended by the authors for prac- 
tical applications, admirable though it may be as a theoretical tour de 
force. 

Before going on to an examination of the problem of lining a latent 
trait model by conventional methods, wc should note the special case of 
the logistic model (4) in which we write 



that is, we set every/ equal to a common value /. This transformation of 
the case, that we have noted earlier to be that of essentially tau- 
equivalent tests, was proposed by Rasch and has been popularized 
recently oy Wright and others. Vox certain purposes we will regard the 
normal ogive model (3) with equal / values as a version of the Rasch 
model also, it is of course indistinguishable from it. For many applica- 
tions for which these models seem to have been intended, we must 
substitute 



E\y, x--x^^^[D{m,^fx^], 



(8) 



i:\y\ X -~- x,\ - i^, -f ( 1 - ^,)N(m, -f /x.). 



(9) 



and 



Llv, A - AM - + (1 - MD(m,+f,x,)\, 



(10) 
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so that the models may be employed on multiple-ehoiee Itemjv Intended to 
measure abilities. The guessing parameters ^£^/ could in principle be 
estimated along with the ^, and the /, (if we are not using the simple 
Rasch model) together w ith the a. values or, as in Lord (1968), they could 
be esiimaied independently, perhaps as the chance level, that is, the 
reciprocal of the number of options. 

The literature on fitiing latent trait models seems to be in a rather un- 
satisfactory state. It is a simple matter to write down the likelihood func- 
tion and its first and second derivatives with respect to the parameters of 
a fixed regressors model. The second derivative matrix is very strongly 
patterned, allowing in principle minimization of (minus-log-times) the 
likelihood function by blocks of iterative steps, one for each set of 
parameters, that are essentially simple Newton-Raphson steps. From 
what is stated, and from what is not explained, by Lord (1968), 
Kolakowski and Bock (1970), and Wingersky and Lord (1973), it appears 
that investigators who have attempted to program what might seem to 
be an unusually simple minimization algorithm have had to deal with a 
large number of problems by trial and error, to the point where the 
reader cannot be sure just what has been programmed. I hope to be cor- 
rected, but there does not seem to be any published demonstration by 
Monte Carlo study that any of the programs for fitting the two- 
paranieler model recovers the true values of the parameters within 
reasonable tolerance. Lord (1968) states that his method does not con- 
verge unless both the number of items and the number of examinees is 
large, and that otherwise values of / tend to increase without limit for 
some items. Wright (1977) conjectures that this must happen, and con- 
cludes that the two-parameter model therefore cannot be fitted to data. It 
does indeed seem that the simultaneous estimation of the item 
parameters / and person parameters .v, may strongly tend to run into 
ditticuliiev of the kind noted by Lord and commented upon by Wright. 
(A similar problem in LawleyN (1942) fixed-regressors factor model is 
solved by the choice of a loss function in the form of a more appropriate 
f; •action of likelihood - McDonald (l979)-but the present problem 
does not seem to yield an analogous treatment.) The introduction of the 
guessing parameters possibly makes the situation worse, especially if we 
attempt (o estimate them rather than supply them as constants. We might 
also question the claims that have been made in favour of the Rasch 
model as free from difficulties in the methods used to fit its parameters, at 
least if the model is applied to multiple-choice items, since in the usual 
estimation procedures there is no provision for estimating the guessing 
parameters, and there is no reason to believe that the estimates of the 
other parameters of the model are unafiected by guessing. 

Actualiv, the case for using maximum likelihood estimation in the 
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iwo-parameier model begins to look less interesting when we note that a 
test of fit of the model does not seem to have been given, to go with the 
likelihood estimates of its parameters, and that *good' properties of the 
maximum likelihood estimators have not actually been demonstrated for 
these models. Although it has been thought otherwise, this last remark 
may well apply to the Rasch model as treated by Wright and others. 
Measures of hi have been suggested for the Rasch model and conjectured 
to have a chi-square distribution (Wright and Panchapakesan, 1969). Ac- 
cording to Wright and Mead (1977), however, simulation studies have 
shown that this distribution is *noi exactly correct'. Our own simulation 
studies suggest that it is not even approximately correct for the measures 
of fit in the OlSE version of a program originally by Wright and Pan- 
chapakesan. It does seem that current methods for fitting these latent 
trait models lack a properly established statistical criterion for rejecting 
the model. Perhaps more importantly, they certainly lack criteria for 
regarding the fit as satisfactory, criteria analogous to the sizes of the 
residual covariances after filling the linear common factor model, it is 
partly for this reason ihai some writers have stressed the need to test the 
unidimcnsionaliiy of a set of items by some means, before actually filling 
a model with a single latent trail. Hambledon et al. (1978) state that 
testing the assumption of unidimcnsionaliiy takes precedence over other 
goodness-of-fii tests of a latent trail model, and that further research is 
needed to establish a proper procedure to test the dimensionality. Crude 
devices have been suggested, such as the examination of the eigenvalues 
of the item covariance or correlation matrix, but such procedures are not 
well founded. (See McDonald, 198L) 

If the analysis just given is correct, latent trail theory is in a prob- 
lematic state, and one that is not without some historical irony. It was in- 
troduced because linear common factor analysis was recob;ni/ed to be in- 
adequate to supply a dimensional analysis for binary items. It has 
reached the point where, given the values of the parameters of a latent 
trail model, we know how lo use them for a wide variety of lesl-lheorelic 
purposes. Yet we still have to resort to a form of linear factor analysis for 
a crude test of unidimensionality, we still have reason to doubt the 
estimation procedures that have been proposed, and we still have no 
satisfactory statistical criterion for rejecting the model and no satisfac- 
tory criterion for regarding its fit as adequate. 

McDonald (1967a, b) gave theory tor nonlinear factor analysis, and 
numerical methods tor fitting nonlinear regressions, in the form of 
polynomial functions, of quantitative tests or of binary items, on fac- 
tors, that is, on latent traits. In contrast to item characteristic curves such 
as the normal ogive and logistic functions, which are nonlinear functions 
both of the latent traits and the item parameters, the polynomial item 
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characteristic curves are nonlinear in the latent traits but linear in their 
coefficients (the item parameters). The immediate consequence is that the 
nonlinear factor model shares many of the simple algebraic properties of 
the linear common factor model. In particular it allows us to assess the 
adequacy of the fit of the model by examining residual covariances. In 
principle, therefore, nonlinear factor analysis supplies a general test of 
the unidimensionality or homogeneity of a set of binary items without 
the strong and false assumption of linear item characteristic curves that is 
implicit in the usual attempts to assess dimensionality prior to fitting a la- 
tent trait model. However, polynomial item characteristic curves share 
the defect of linear item characteristic curves that they are not bounded 
as required for probabilities. It might therefore seem unlikely that we 
could use nonlinear factor analysis in practice for this purpose, but. 
shortly we will see that we can in fact do so under some conditions, 

McDonald (1967a) sought to show that latent trait models such as the 
normal ogive model can be treated as special cases of nonlinear factor 
analysis by expressing the normal ogive curve as an infinite series whose 
terms are polynomials that are mutually orthogonal under the assump- 
tion that the latent trait has a normal distribution. If ,v has a normal 
distribution with mean zero and variance unity, then the normalized 
Hermite- Tchebycheff polynomials given by 

Kix)- ( - 1)VV^ p = 0, 1,2, . . ., (11) 

have mean zero, variance unity, and covariances zero. That is, 

£*IMa-)| = 0 (12) 

and 

E\h,{x)h,{x)\^\.p^q, (13) 
= 0, otherwise. 

The first four orthogonal polynomials are given by 

/?o= 1 

/?2(A-)-(.v^- l)/v2 
/h(.v) -(.v' - 3.v)/v6. 

Recalling that the first six moments of the normal distribution with mean 
zero and variance unity arc /it =0, /i2 = 1, /i, =0, /i4 = 3, /is =0, /i6 = 15, we 
easily verify for example that 

L\lh(x)h2(x)\^E\{x'~x)/^'2\ 
= -/ii)/\2 
-0, 



ERLC 



220 



The Improvement of Measurement 



and 

fcll/^(-V)!-|-/:i(A-^'-6A^+ 9.v^)/6| 
= 1, 

and similarly for ihe remainder. (These polynomials serve as a good 
classroom demonstration of the fact that, if two random variables are 
uncorrelated, they are not necessarily statistically independent, and in- 
deed one can be a curvilinear lunction of the other.) Orthogonal 
polynomials such as these provide the building blocks tor a curvilinear 
regression, in which the uncorrelated components supply additive 
variance. That is, instead of fitting a polynomial regression of some>' on 
some A' as 

y = Uo f ^/,A f ^/:A'^ + W^A'^ + . . . f (14) 

it is usually better to fit 

y ^/^i/j|(.v) + /?2//2(A') + /;3//^(A')+ . . . +^ (15) 

because the terms in (15) are uncorrelated and supply contributions to 
the variance of v whose magnitude and significance can be assessed 
separately. The scries can be terminated when all systematic variance has 
been captured. 

Ciiven orthogonal polynomials appropriate to the distribution of A (noi 
necessarily the Hermite-Tchebychetf series (11)), the method of 
polynomial factor analysis introduced by McDonald (l%7a, l%7b) 
amounts to recognizing that if we write nonlinear common factor models 

y, - a,() -f ^/,iA*+ ^/,2A' -f . . . + e,, (16) 

or 

y,^ + h,Jh(x) "rhJhAx) + . . . ^e,J^ 1, . . ., //, (17) 

with uncorrelated residuals, then in the second version the polynomiiils 
behave just like orthogonal common factors. The technical problem of 
nonlinear factor analysis (which need not concern us here) is to 
discriminate between a model such as (17) and the parallel linear model 

y, h^n f /;,iAt f /;;:A: + . . . 4 c',, (18) 

both of which imply the covariance structure 

Covl.v,, V, ! = /;,! /^.t +/^:/^2 + . . .,y>A'. (19) 

[ his IS done by studying the distribution of factor scores in common fac- 
tor space, to see if these lie within curved subspaces. (See McDonald, 
1967b.) 

We can use a kind of harmonic or Fourier analysis to approximate any 
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Figure I Normal Ogive (^ = 0.5, a = 2.0) with Polynomial Approxi- 
mations 



prescribed curve by a polynomial series. By making the series long 
enough, we can make the approximation as precise as we please over a 
finite range. For the remainder of this paper, it will be convenient to 
describe the normal ogive characteristic curve in the traditional way as 
N(x\ fij, a,) where fXj and oj are the mean and standard deviation of the 
cumulative distribution function yV(.), so that in (3) 

fj^\/oj (20) 

and 

/Wy= -^j/o,, (21) 

Figure 1 shows a normal ogive with ^, = 0.5, ay = 2.0. Superimposed upon 
it are the best-approximating linear, quadratic, and cubic curves, ob- 
tained by stopping at the second, third, and fourth term of the series 

yj = b,o -f bjiX-^ bj2(x^ - 1)/V2 -f bj^ix' - ix)/yj6. (22) 

The coefficients 6,o, b,\, bji, bj3 are chosen to give a least-squares best fit 
of the polynomial curve to the normal ogive, weighted by the normal 
density function. That is, the coefficients are chosen to minimize 

^ = E[iN(x; ^„ aj)~ i bj.h.Myi r=^0. 1, . . . (23) 
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McDonald (i967a) showed that if we express N(x; /i,, a,) as the infinite 
series 



where 
and 



N(x;fi,.a,)= L b,A(x) (24) 



lh^ = N{~^,/a,) (25) 



b„ = p-^^ar%-iWa,)nWot,),p=\ (26) 

where a, = (\ -fa/)' ■ and A/(.) and ^7(.) are the normal distribution and 
normal density functions, then every finite segment of the infinite series 
has the property that it minimizes (23) for the chosen number of terms. 
In particular, (26) yields 

b,^=a:^nWa,) (27) 

br.= I nWa,) (28) 

bJ^= \ aj'h.WaM^/c^j)- (29) 

The intended application of this theory was to fit the nonlinear factor 
model to a set of binary data by methods described in McDonald 
(!967b), up to the second or third degree, say. If the single-factor 
nonlinear model gave a reasonable account of the data we would then ex- 
amine the distribution of the factor scores to see if it is normal, and ex- 
amine the coefficients b,,, to see if their relationships were consistent with 
those required by (25) and (26), The equations (25) and (26) could then be 
solved for (i, and a,. Unpublished work by McDonald and Ahlawat 
showed that reasonably precise estimates of the parameters of the normal 
ogive model could be obtained using this technique. (See also McDonald 
and Ahlawat (1974) for an account of difficulty factors in terms of this 
theory.) 

Recent developments in the analysis of covariance structures 
(McDonald, 1978b; 1980) have made possible a more direct application 
of this theory, and the rest of this paper will focus upon the new method. 

Because the representation (24) of the normal ogive is a linear com- 
bination of (random) orthogonal functions of the random variable x, it 
follows by a weak implication of the principle of local independence that 

A - Pi V, = 1 1 = £|>^.| = E\yf\=^ b,o. (30) 
A»=Pt>'.= '.>'* = It = 1= 2 /b„A,.. 7 + ^. (31) 

(.=0 



ERIC 2 



^ J 



A Ifernatiw Approaches: Fit ting Latent Trait Models 223 



where b,o and b„ are the f unctions of ft, and a, given by (25) and (26). We 
can rewrite (30) and (31) in a familiar matrix form by defining y' = [I, v,, 
. . an (n+ I)-component vector whose first component is unity' 

""f'^'/'l'^'^^ ^'^P=^ b' = [/7,o, . . .,M,P'=()^i, . . ,p„l 

and P=-[p,k]. Equations (30) and (31) can then be expressed in the form 

(32) 



n ; p'- 




"1 : " 




"1 : b'] ^ 


"0 


-P : P. 




-b : B. 




L : B'J 





where is the diagonal matrix whose jih diagonal element is 



(33) 



The right member of (32) is formally the same as the structure implied by 
the orthogonal common factor model. In this case the column-order of 
B-the number of ^common factors'^is infinite, but the infinitely many 
elements of B arc all functions of the In parameters /i,, a,, 7 = 1,, , 
That IS, we have expressed the normal ogive model, which is nonlinear in 
Its parameters and in the latent trait x. as a linear combination of in- 
hiiiiely many nonlinear functions of the latent trait, with coefficients that 
are nonlinear functions of the parameters of the model. Consequently 
the normal ogive model becomes a special case of the common factor 
model. 

McDonald (1978b) has described a model for the analysis of 
covariance structures which allows higher order factor analysis of any 
order, with residual matrices of any prescribed structure. For the present 
application, the important property of the model is that the user can 
mipose constraints on the matrices in it by making each element of 
each matrix a prescribed function of one or more MundamentaP 
parameters -the parameters, that is, with respect to which the model is 
actually titled. A program COSAN has been written for the model, and 
some applications are described in McDonald (1980). Program COSAN 
minimizes one of several loss functions with respect to the parameters of 
a given model, using a quasi-Newton method. For many purposes, the 
constraints on the model consist in setting certain elements of a factor 
loading matrix, or a residual or correlation matrix, equal to a constant 
(usually ^ero for simple structure orthogonality, and unity for a self- 
correlation) or constraining two or more elements to be equal, as in the 
work of Joreskog ( 1 970). In addition to these standard provisions of pro- 
gram COSAN, the user can write sub-routines of his own- usually very 
short and simple- if he wishes to prescribe special constraints upon the 
elements of the matrices in the model. It is therefore very easy to use pro- 
gram COSAN to fit the parameters of the normal ogive model by fitting 
the version of the orthogonal factor model in (32) to a sample counter- 
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part of P*, where the elements of b and B are the prescribed functions of 

/i, anda,,y= 1 n. Inpractice, of course, we must truncate the matrix 

B to be of some finite column-order, and therefore we are in a sense 
fitting an approximate version of the normal ogive model. However, the 
coefticients b,,, rapidly diminish as /; increases, and trial suggests both 
that terms beyond the cubic are negligible in magnitude and that in- 
cluding ihem would not improve the precision of estimation of the fun- 
damental parameters of the model at all, even if it were to improve the hi 
slightly. 

With B truncated to an /7x3 matrix, we ht the model (24) to a sample 
matrix 



(34) 



in which the yth component of s, Lv^/A', is the proportion of examinees 
in the sample passing item y, and the (/, A)th element of S, i: v„v,yN, is 
the proportion of examinees passing items j and k. We minimi/e the 
usual least-squares function 

-//KP^-S*)^ (35) 

with respect to the In parameters /i,, o,. By fixing the a, values to be equal 
to a common value a, we may tit an equivalent of the Rasch model, 
minimizing (35) with respect to the /t, and o. if we fix 0 = 0, wc seek to fit 
the perfect scale, estimating the/^, only. (See McDonald, l%7a.) We can 
introduce a guessing parameter by replacing the model with 

- 1 A- - X I - ,1^, + ( 1 - i^,)N(x,\ Mm o,). (36) 
Correspondingly, (25) and (26) become 

h,u H,'r( \ ~^,)N( -~^,/cx,) (37) 

and 

/^;,-(l .;'/)P'"fV7/';.-l(M/^^.)^'(/i//^^/)'/^"= 

Wc could then read in guessing parameters, possibly as the reciprocal of 
the number of options in multiple-choice items, or estimate them, in 
combination with any of the options for the a, values. To apply COSAN 
to these purposes, one library sub-routine of seven executable 
statements, to evaluate the normal density and distribution functions, 
and two special sub-routines, of simple logical structure, of 14 and 68 
executable statements are needed. A program has also been written io 
eenerate normal ogive data on which to test the method. A program 
\U:SAMAX, (he OlSE version of a program by B. Wright and N. Pan- 
chapakesan for fitting the Rasch model was the only program available 
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Table I Kxampte I: True a - 1.33 



I rue 



csiinuuccl csiinuiicd 
cubrc cubic 
tmcd fiKcd 

V 50 (XX) A ' 3(KX) 



10(7\ 

cstimaied 
aibic 
filled 

.V 5(X) 



1 ,(H) 


\ .(XX) 


l.(X)6 


-0.976 


O.NO 


0.H16 


0.803 


-0.604 


(J. 60 


0.593 


0.641 


-0.538 


0,40 


0,40:' 


0.372 


- 0.424 


0.20 


o.hr 


0.169 


- 0.173 


0.(K) 


0.012 


0;()32 


0.045 


0.20 


0.198 


0.164 


0.366 


0.40 


0.39S 


0.438 


0.407 


0 60 


0.582 


(\655 


0.587 


0 SO 


0.S09 


0.820 


0.841 


mm 'i 




1.259 


1.032 


ma\ o 


!/^^9 


1,490 


1.578 



1(T 

csiimaied 
cubic 
fitted 

.V-500 

-0.998 
-0,746 
-0,582 
-0,419 
~ 0,210 
0.049 
0.355 
0,396 
0,591 
0,828 

1,318 



la 

estimated 
srraighi 

line fitted 
/V-500 

-0,989 
-0,740 
-0,578 
-0,416 
-0,208 
0,049 
0,354 
0,395 
0,588 
0,825 

1,297 



RASCH 

program 
/V=500 

-0,8997 
-0,5805 
-0,5352 
-0,3576 
-0,0509 
0,0250 
0,3098 
0,4718 
0,7010 
0.9162 



tor comparison with LXMivcniional itieihods for filling latent trait models 
thai gave believable results. This sets limits upon comparisons that can 
he made in iwo-parameler cases. 
A large number of constructed examples have been run. Of these, 
^ three will be described, and other observations of the behaviour of the 
method uill be briefly summarized. 

txample I: Three data-sets were generated, with sample sizes 50 000, 
3(){K). and 500 respectively, whose true. /i, values are listed in the first 
column of Table 1. A common value oF o= 1,33 was employed for all 
Hems to enable a reasonable compari.son with the available program for 
liutng the Rasch model. Each of the three resulting 11x11 raw product 
moment matrices (S* in (34)) was analysed four times by COSAN: (a) 
tiding ten /i values and a common o value versus fitting ten ^, value.s and 
ten i} values; (b) usin^ the cubic approximation to the normal ogive 
model versus using the hnear approximation, that is, deleting the col- 
umns ot B containing coeflicients h.^, b,„ Table 1 gives the estimates of 
the parameters for five of these analyses, as well as the estimates ob- 
lamed h> pro MEiSAMAX applieci to the raw data for sample size 
^00 and iranstormecl for compatibility with the normal ogive representa- 

These results illustrate observations that have been made from a wider 
ranue ot analyses. The estimates of /i, are not noticeably more precise 
when one a is fitted than when they are fitted individually. That is, it is no 
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Table 2 Example 1: Raw Product Moments (/V==3000) 





1 


2 








o 


7 




9 


10 


1 1 


1 


1,000 






















2 


0,722 


0,722 




















3 


0,689 


0,544 


. 0,689 


















A 


0,65 1 


0,510 


0,502 


0,651 
















5 


0^589 


o!482 


o!458 


0,439 


0.589 








■ 






6 


0,541 


0.435 


0,420 


0,407 


0.376 


0,541 












7 


0.493 


0.403 


0.393 


0.375 


0.344 


0.320 


0.493 










8 


0.460 


0,379 


0.369 


0.355 


0.330 


0.311 


0.286 


0.460 








9 


0,402 


0.333 


0,327 


0.309 


0.295 


0.266 


0.254 


0.240 


0,402 






10 
11 


0.355 


0.297 


0.292 


0.281 


0.266 


0.247 


0.224 


0.214 


0.184 


0.355 




0.320 


0.267 


0,264 


0.248 


0.232 


0.221 


0.211 


0.201 


0,175 


0.157 


0,320 
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Example 1: Residuals, Cubic Approximation 
















1 


2 


3 


4 


5 


D 


7 




9 


10 


1 1 


I 


0.000 
























O.(KK) 


0.000 




















3 


0,000 


0.001 


0.000 




















0,001 


- 0,005 


0.003 


-0.000 
















5 


- 0.(X)2 


0,007 


-0.003 


-0,000 


0.000 














6 


0.001 


-0.001 


-0.004 


0;003 


-0.000 


-0.000 












7 


0,000 


0.001 


0.001 


0.001 


-0,005 


-0.003 


-0,000 










8 


0,001 


-0,001 


-0,002 


0.000 


-0.001 


0.004 


-0,000 


-0.000 








9 


-0,001 


0.000 


0.002 


-0.001 


0.005 


-0,003 


0.003 


0,000 


-0.000 






10 


' 0.002 


-0.000 


0.001 


0.003 


0.005 


0.005 


-0.002 


-0,002 


-0.007 


0,000 


-0.000 


11 


0.002 


-0,001 


0.001 


-0.004 


-0.005 


0.001 


0,005 


0.003 


0.001 


-0.002 



Table 4 Example 1: Residuals, Linear Approximation ^ 
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2 


3 


4 


5 


6 


7 
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1 n 


1 1 
1 I 


I 


0.000 


















2 


-0.001 


0.000 




















3 


- 0.001 


0.002 


0.000 




















~ U.vJUO 


0.004 


0.004 


0.000 
















5 


- fy.(X)2 


0.(K)8 


0.002 


- 0.000 


0.000 














6 


0.001 


-0.001 


0.004 


0.003 


-0.000 


0.000 












7 


0.(X)1 


0.000 


0.001 


0.001 


-0.005 


-0.003 


0.000 










8 


0.001 


0.002 


- 0.003 


0.000 


-0.001 


0.003 


-0.000 


-0.000 








9 


-0.000 


-0.001 


0.002 


-0.002 


0.004 


-0.004 


0.003 


0.000 


0.000 






10 


-0.(K)1 


-0.001 


0.000 


0.002 


0.004 


0.005 


-0.002 


-0.002 


- 0.006 


-0.000 




11 


0.002 


-0.002 


0.000 


-0.005 


-0.006 


0.001 


0.005 


0.004 


0.002 


-0.000 


0.000 



Table 5 Example I: Coefficients of Cubic 



ho 


hr 




by 


0.722 


0.196' 


-0.048 


-0.018 


0.689 


0.217 


-0.047 


-0.025 


^ 0.650 


0.223 


-0.037 


-0.028 


0.591 


0.239 


- 0.024 


-0.035 


0.540 


0.235 


-0.010 


-0.033 


0.492 


0.238 


0.002 


-0.034 


0.459 


0.247 


0.011 


-0.039 


0.403 


0.216 


0.021 


-0.026 


0.357 


0.208 


0.030 


- 0.023 


0.318 


0.205 


0.040 


-0.022 



I 



r5 

Op 

I 

I 
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more difficuh to tu the iwo-parameier model than . to th the one- 
parameier model, using this method. The estimates are not noticeably 
more precise when the cubic approximation is used than when the linear 
approximation is employed. Table 1 gives the minimum and maximum 
values of when these are estimated as ten distinct parameters, and the 
value ot" d when a common parameter is estimated. 

Table 2 gives the 11x11 raw product-moment matrix for sample size 
3000. Table 3 gives the corresponding residual matrix using the cubic ap- 
proximation, while Table 4 gives the residual mairi.x using the linear ap- 
proximation (with ten a, values estimated in both cases). Table 5 gives the 
estimated values of the coefficients in b and bj\, b,2, b,i in B, cor- 
responding to Table 3. (These are estimated parametric functions of {i, 
and d,.) Since they are coefficients of orihonormal polynomials, they 
behave like loadings on orthogonal common factors, showing directly 
that the data are actually accounted for to a good approximation by a 
linear model. The approximation is only slightly improved by the ad- 
dition of the quadratic and cubic terms. Terms beyond the cubic would 
almost certainly be quite negligible. At the same time, fitting the cubic 
approximation can be recommended on the basis of a general observa- 
tion, not illustrated by the comparison of Table 3 and Table 4, that 
usually the residuals from the cubic approximation are just sufficiently 
smaller to constitute slightly better evidence that the data are unidimen- 
sional and adequately described by the normal ogive model. 

Example 2: A data-set, of sample size 3000, was generated, to consist 
of 50 items, combinations of five o, values and ten ^, values, as shown in 
the margins of Tables 6 and 7. These contain the estimates by COSAN 
(using the cubic approximation) of the ^, values and the a, values respec- 
tively. Inspection suggests that the precision of the estimates of the ^, 
values is not noticeably affected by the size of itself, or the si/e of a„ 
and that the precision of the estimates of the a, values, while approxi- 
mately proportional to a„ is not noticeably affected by the size of m,. 

Example 3: Again with a sample size of 3000, and with a common a 
value of 1.7, 20 binary items were simulated in ten pairs, with/i, values, 
repealed, as in the previous example, but with the parameter ^, in (36) in- 
troduced and set to 0.2 for the ffrsl member of each pair, and 0.5 for the 
second member. It is as though each odd-numbered item is a multiple- 
choice item with five options, while the following even-numbered item is 
an otherwise equivalent true/ false item. The first column of Table 8 gives 
the true /i, values. The second contains the estimates by COSAN, with the 
cubic approximation, and reading ^, values alternately of 0.2 and 0.5 and 
holding them fixed, as we might do from knowledge of the item formats. 
The third column contains COSAN estimates assuming that there is no 
effect of guessing on the data, that is, setting each g, value to zero. The 
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Table 6 Example 2: ^, Kstimales {.\ 3000) 



True a, 





2.2"^ 


2.0(L 


.1 J2 


1.52 


1,33. 


- 1.00 


1.042 


- 0.997 




I .UJD 




-0.80 


o!8(K) 


0.865 


-0.756 


-0.756 


-0.757 


0.60 


0.584 


0.52^ 


-0.529 


-0.595 


-0.541 


0,40 


-0.456 


-0.358 


-0.398 


-0.366 


-0.381 


0.20 


-0.316 


~ 0.236 


-0.207 


-0.277 


-0.245 


0.(K) 


0.050 


0.040 


-0.038 


0.013 


-0.040 




{\ '>n7 
u. / 




r\ 1 OA 


0,239 


0.252 


U.4U 


U . 'O 


U . 4 / 0 


0.4Uj 


0.319 


0.484 


U.DU 


U.4oV 


n Ai 1 


U. jVo 


/\ CAT 

0.593 


0,605 


U. oU 




U. / jV 


U.o4 / 


0.879 


0.817 


fable 7 


Kxample 2: 


Kstimates (/V 


= 3000) 












True (J, 






True /i, 


2.27 


2.(X) 


1.72 


1.52 


1.33 


1 (K) 


2.2.^7 


2.062 


1.768 


1.598 


1.115 


0.80 


2.2JS9 


2.057 


1 .680 


1.569 


1..148 


0.60 


2..\"?2 


2.059 


1.606 


1.471 


1.316 


0.40 


2.2.11 


1.S45 


1.818 . 


1.585 


1.121 


0.20 


2.504 


2.050 


1.687 


1.747 


1.348 


0.(K) 


2.620 




1.995 


1.452 


1..1.18 


0.20 


2.12.1 


2.141 


1.617 


1.579 


1.458 


0.40 


2.0^S 


1.990 


1.954 


1.5.14 


1.502 


0.60 


2.241 


2.405 


1.668 


1.589 


1.1.19 


O.SO 


2.419 


1. 928 


1.79.1 


1.652 


1.5.19 



fourth column contains the estimate.s of the fi, values obtained from 
MESAM A.\, which of course makes no provision for guessing, it is clear 
that the efVects of guessing in multiple-choice items must be allowed for 
in the analysis. The method of Wright and Panchapakesan yields quite 
unacceptable estimates of the difficulty parameters in the presence of 
guessing, cS does the COSAN method when the guessing parameters are 
assumed to be zero. When the guessing parameters are treated as known 
(as in Lord, 1968), the COSAN method gives good estimates of the other 
parameters. Attempts to use COSAN to estimate guessing parameters, as 
well as the other parameters of the model, have run into difficulties re- 
quiring further research, it seems likely that the nonlinear factor model 
(17) will have to be fitted directly to raw data if the present method is to 
yield estimates of the three-parameter model. 
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Table 8 Example 3: Estimates 





COSAN 


COSAN 






estimate 


estimate 






with 


with 


RASCH 




-. - uue JEfr-. - 


«.=~0.-...- 


estimate 


1 .00 . 


- 1.073 


-1.962 


-0.392 


- 1.00 


-0.934 


-2.648 


-0.889 


/ 0.80 


-0.887 


-1.734 


-0.220 


-0.80 


-0.768 


-2.480 


-0.771 


-0.60 


-0.593 


-1.380 


0.016 


-0.60 


-0.602 


-2.315 


-0.648 


0.40 


"0.518 


-1,291 


6.085 


0.40 


-0.434 


-2.153 


-0.538 


0.20 


- 0.208 


-0,927 


0.332 


-0.20 


-0.196 


-1.932 


-0.388 


0.00 


-0.008 


-0.697 


0.481 


0.00 


0.012 


- 1 .745 


-0.267 


0.20 


0.160 


-0.507 


0,606 


0.20 


0.137 


- 1.637 


-0.203 


0.40 


0.431 


-0.209 


0.802 


0.40 


0.2J2 


-1.565 


-0.143 


0.60 


0.557 


-0.073 


0.885 


0.60 


0.585 


- 1,273 


0.034 


0.80 


0.800 


0.180 


1.052 


0.80 


0.853 


1.075 


0.178 



While a more systematic Monte Carlo study is desirable, the examples 
serve to show that we can indeed fit the normal ogive model, in a one-, 
two-, or three-parameter version (the latter with known guessing 
parameters), by a program lor the analysis of covariance structures, with 
reasonably satisfactory results. The fact that we can do this at all 
illustrates an e.ssential unity of psychometric theory for 'quantitative' 
tests and ^qualitative' items that we might easily lose sight of while con- 
centrating on the details of fitting the models by the conventional 
statistical procedures. The tact that we can do this reasonably well sug- 
gests that the technique deserves further exploration, as possibly a useful 
one at least for some data .sets. Already it is clear that reasonably precise 
estimates can be obtained over a range of sample sizes, from about the 
smallest we should ever use for such work to indefinitely large. The ob- 
vious advantage of the method is that it supplies a measure of the 
goodness of fit of the model in the familiar form of a residual matrix, 
and the sum of squares of its elements, and it does not require a prior ex- 
amination of the dimensionality of the data. An obvious limitation of the 
method is its assumption that the latent trait has a normal distribution. 
There seems to be less willingness on the part of investigators to assume 



ERJC 



Jo 



A liernative Approaches: Fitting Latent Trait Models 23 1 



normalily of ihe latent trait underlying a set of items than to make the 
same assumption for the common factor underlying a set of tests. This 
would be because it is easier to exert control over the distribution in the 
former case than in the latter, by the design of the items, (We can skew 
j[h£ distribution of a factor by choosing tests that are too difficult or too 
ea7y7^f tTaftlfir ns"^'aM b7 cTiodsTng^a wide range of icsT^ 

difficulties. Since tests arc item sums, the cases cannot be very difVercnt,) 
If the method suggested seems worth using, the user can in principle 
design his item set to have a close-to-normal distribution of the latent 
trail. Since users may not want to do this, further research will include an 
inycvtigalion of the robustness of the method against violations of the 
normality assumption. If it is iiot sufficrentry rol)usi fcff^'generaf^ 
it will become worthwhile to repeat the theoretical work leading to equa- 
tions (25) and (26), using a more general distribution of the latent trait. 
Because this work had been done originally for the normal ogive model, 
discussion in this paper has been. confined to that case, just to save the 
rethinking thai would be necessary, even to state the corresponding 
results for the equivalent logistic model. 



Anderson, F. W. Some scaling models and estimation procedures in tlic latent 
class model. In O. Grenander (Ed.), Prohuhi(i(v and statistics. (The Harold 
Cramer Volume). New York: Wiley, 1959, 9-38, 

Bock, R, D. and Lieberman, VI. Fitting a response model for n dichoiomously 
scored items. Psychomethka, 1970, 35, 179-97. 

Cronbach, L. J., Oleser, G. C\, Nanda, H., and Rajaratnam, N. The depend- 
ability of behavioural measurements: Theory oji^eneralizability for scores and 
profiles. New York: Wiley, 1972. 

Finney, D. J. Probit analysis, C*ambridge: Cambridge University Press, 1952, 

Ciuttman, I . Chapters 2, 3, 6, 8, 9. In S, A, SioutVer ei al,, Measurement and 
prediction. F^rinceion, NJ: Princeton University Press, 1950, 

Hambledon, R. K., Swan^inaihan, H., Cook, L. L,, Eignor, D. R, and Gilford, 
J. A. Developments in latent trait theory: Models, technical issues, and ap- 
plications. Review of Educational Research, 1978, 48, 467-510, 

.loreskog, K. G. A general method for analysis of covariance structures, 
Biometrika, 1970, 57, 239-51. 

Kolakowski, f). and Bock, R. I). A Fortran-/ V profirani for maximum /ikclihood 
Item ana/vsis and test scorini^: Sornwl oiiive model. (Research Memorandum 
No. 12, 1970). Chicago: University of Chicago, Department of Education, 
Statistical Laboratory, 1970. 

Law ley, D. N. Further investigations in factor estimation. Proceedings of the 
Royal Society of Edinburgh, 1942, 62, 176-85. 

I aw ley, I), N. On problems connected with item selection and test construction. 
/ 'roceedi ni^s oft he Royal Socie ty of Ed in burghs 1 943 , 6 1 , 27 3 - 8 7 , 

La/arsfeld, P. F, Chapters 10, II. In S. A. Stouffer et aL, Measurement and 
prediction. PrmcetOn, NJ: Princeton University Press, 1950. 

Lord, F. M. A theory of test scores. Psychometric Monographs, 1952, 7. 



REFERENCES 





232 



The Improvement oj Measurement 



L ord. \ . M, An analysis o( the verbal scholastic aptitude test using BirnbaumN 
threc-paranieter logistic model. Lducaiional and Psvchoioi^icai XleasureKicnt, 
1968, 28, 989-1020. 

l ord, V. M. and Novick, M. R. Statistical theories of fnentai test scores Reading, 

Mass.: Addison-VVesley. 1968. 
McDonald, R. V. A note on the derivation of the general latent class model 

— t^Kyrnomeirikrjrmi; 27r2m'~'t)-. 

McDonald. R. P. DifFirulty factors and nonlinear factor analysis. British Journal 

of Mathcfnuticai and Statistical Psycholo{^\\ 1965, 18, 11-23. 
McDon;>!d. R. P. Nonlinear factor mi\\s<\s. Psvchofnetric Slofw^ranhs, 1967 

15. (a) 

McDonald. R. P. Numerical methods for polynomial models in nonlinear factor 

analysis. Psychofnetrika, 1967, 32, 77-112. (b) 
McDonald. R. P. (ienerali/ability in factorable domains; Domain validity and 

gcrrcr.rii/tthtlHAv f,Wi/tY/f/^///f/f t//f^/ Pwrholouical Measurement, 1978 38 

:'.v79. (a) 

McDonald, R. P, A simple comprehensive model for tfie analysis of covariance 
structures. Hntish Journal of Mathetnatical and Statistical Psvcholouw 1978, 
31, 59^72. (b) 

McDonald. K. P. The simultaneous estimation of factor loadings and scores. 

/iritish Journal of Mathetfuttical and Statistical Psychology, 197"9, 32, 212-28. 
McDonald, R. P. A simple comprehensive model lor the analysis of covariance 

structures: Some remarks on applications, Britis)] Journal of Matfwfnatical 

and Staiistical Psvcholoi^y, 1980,33, 161-83. 
McDonald. R. P. I he dimensionality of tests and items, British Journal of 

Mathematical and Statistical Psychology, 1981. 34. 100-17. 
McDonald. R. P. and Ahlauat, K. S. Ditliculty factors in binary data. British 

Journal of Mathennitical and Statistical Psycholot^y, 1974, 27,' 82-99. 
McDonald, R, I*, and Burr. L. J. A comparison of four methods of constructing 

factor scores. Psychotnetrika, 1967, 32, 381-401. 
lorgerson. VV. S. 7 heory and methods of scalinii. New York: Wilev, 1958. 
Wmgersky. Marilyn ,S. and Lord. P. M, A compvter pro^ratn for estimating cx- 

ammvc ahdd v and item characteristic curve para/ncters when there are omitted 

responses. (RM-73-2). Princeton. NJ: Educational 'I'esting Service. 1973. 
Wright, H. Sohing m',:asurenient problems with the Rasch model. Journal of 

i'ducanonal Measurement, 14.97-116. 
W rigfit, B. and Mead, R. .1, BlCAi.: Calihratinii items and scales with a Rasch 

fueasarement model. (Researcfi Memorandum No. 13), Chicago: University of 

C hicago. Department of Eulucaiion. Statistical l aboratory. 1977. 
\\ right. B. and I'anehapakesan. N. A procedure for sample- free item analysis, 

l:ihnatu>nal and l^vcholoiiical Measmcfneni, 1969. 29, 23 -48. 

ACKNOWLEDGMENT 

I he author wtnild like to thank C. f-raser for the programming and numerical 
worf jn this researcii. 



2, ) 



Alteniafive Approaches: Fitting Latent Trait Models 233 



APPENDIX 

Scoring an examinee 

A preliminary investigation has not revealed any advantages of the 
polynomial approximation over the conventional treatment when it 
to-5coring arr cxannnee74.er-(>bhHittfvg-*f>-^timate"et^ his kte-ni- 
trait given his item scores. However^ for completeness, some remarks can 
be made about interesting parallelisms between equations for the estima- 
tion of a latent trait from binary data and equations for the estimation of 
a common factor from test scores. The one possibly useful result to 
emerge so far from the examination of these parallelisms is an expression 
that contains an estimate in closed form of the latent trait» based upon 
the linear approximation. Whether it is close enough to the conventional 
estimate will need to be investigated by a Monte Carlo study. 

In tfie usual treatment of tlie eoininon factor model, with random 
common factors, we first tit rhe parameters of the model (factor loadings 
and uniquenesses) lo the covarianee matrix of a ^calibration' sample. For 
any examinee in the population that the model purports to describe, we 
estimate his factor scores as linear combinations of his test scores, in the 
Spearman oase (1), the Weighted Least Squares (WLS) formula of 
llarrtcrr, — 

- ^' (v, - (Al) 

' I Jr J '=i \ J; ' 

minimi/es the sum of squares of the given examinee's n residuals 

e,==y,-m. J]\\ (A2) 

weighted by the reciprocal of the variances of these residuals in {he 
population. That is, it minirni/es 

'=1 VarU>,| -1 1 -J} 

If we assume that each residual has a normal distribution, then (Al) i< 
also the maximum likelihood estimator of a\ fhe corresponding 
IJrivveiuhted 1 east Squares (Ui.S) estimator 



/=1 



I 



!:_/(>',- m,), (A4) 

minimises the sum of the squares of the given examinee's residuals 
without takinu account of the residual variances. Thai is, it minimi/es 



c^. - i: cf - i: (,v, - ni - /,x)\ (A5) 



er|c ^'^-i 
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(Scoring formulae (Al) and (A4) are respectively Method 2 and IViethod 
1 as discussed by McDonald and Burr, 1967.) 
The condition for a minimum of 0i in (A3) may be written as 



-,^r\-]r^jf 1 



(A6) 



which on rearrangement gives (Al). 
Writing (1) as 



whence 



and noting that 



wc can rewrite (A6) as 

/-I 

or as 



where 



dx 



VarkJ-Var(>^jA-i, 



dx 



/ Var(v,Uv| 



w.=4^/ Var|>^,iAi. 
dx 



(A7) 
(A8) 

(A9) 
(AlO) 

(All) 
(A 1 2) 



That is, wcchoose A'such that the weighted sum of the residuals becomes 
zero, with weights that consist of the slope of the regression of each yj on 
A' divided by the conditional variance. In the linear common factor model 
both the terms in the weight Wj are independent of A', one because the 
model is linear, and the other because (in the usual treatment of the 
model) we assume that the residuals are homoscedastic. 

Now suppose we have any latent trait model with a single latent trait x 
and known parameters in the item characteristic curve Pj{mj-\-fjX). We 
write Q,-\ - Pr We wish to estimate a' for a given examinee with binary 
item scores V) , . . v„. By well-known theory, the likelihood equation is 
given by 



dx 



(A13) 
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Since, in the case of binary dala, 

Vartv, x\^P,Q,, 



{AI4) 



equation (A 1 3) is formally the same as equation (A 10). That is, in any 
Jatent trait model with a single latent trail and known item parameters, to 
obtain the maximum likelihood estirhatc or"an "examme(rs score" we" 
choose a score such that the weighted sum of his residuals becomes zero, 
if weighted by the slope of the regression of each y, on a' divided by the 
conditional variance. In contrast to the linear case, however, both the 
regression slope and the conditional variance are in general functions of 

A'. 

in the special case where the n item characteristic curves PXx) are iden- 
-HcaK equal to P, say, and P has an inverse function P"', it is well known 
that (AI3) can be solved for xin clo.sed form. (See Birnbaum in Lord and 
Novick, 1968, pp. 458-9.) This follows because we may then write (A! 3) 



as 



w fie nee 



so that 



a.v 



PQ 



L iv,-P)-0 



i: \% - nP(x) 



x^P 



n '=1 ' 



(AI5) 
(A16) 

(A17) 



Again following Birnbaum we note that the logistic function ^'(O in (6) 
uniquely has the property that 



(A18) 



whence it follows that 



<mP(m: +/A)1 ^ /;^-)|)(| _ <\;\[)(fn, f AaODI /)/;, (A 19) 

(ix 



independent of a', so that in this case (A13) becomes 



or 



i:jMO(m,^f,x)] -- '^f,y,. 



(A2()) 



(A2I) 



ERJC 



24 J 



236 



The liupmveftwnt of Measurement 



which is a way of expressing ilie t'aci thai ihe weighted iiem sum 

is a sutficicni siaiisiic tor \ for a given examinee in the (wo-paramcier 
logisiic model if the values of the/ are known. In the special case of the 
Rasch model, in which the coelficients / have a common value, it t'ollows 
that the unit-weighted sum 

is then a sutiicient statistic tor .v. This latter property might be considered 
useful in some practical applications. However, it is primarily if we 
choose to fit the fixed rcgressors version of the model, simultaneously 
estimating the item parameters and the latent traits in the calibration 
sample, that this fact gives the one-parameter model an advantage over 
those models that realistically allow for guessing and for the fact i^hat, in 
.uenerul, i(ems measuring a given trait will not measure it equally well. 

If we wish to apply the Rasch model to the measurement of ability by 
means of multiple-choice items, it w ould seem by Example 3 given earlier 
that we must introduce a guessing parameter. If we do so, the likelihood 
equation docs not yield a counterpart of (A 19), whether or not the /; 
values are supposed equal. That is, the attempt to apply the model to 
multiple-choice items by the introduction of a guessing parameter 
destroys the sufliciency that has been regarded as an important properly 
of the logistic model (and destroys other properties of the Rasch model 
that have been regarded as its important special characteristics). Perhaps 
more importantly, whether or not a single function of the examinee's 
responses is a sufficient statistic for his .v, we cannot in general solve the 
likelihood equation in closed form. 

r lie remarks in this appendix arose out of a tentative exploration of 
the problem of scoring the examinee in terms of the polynomial approxi- 
mation model that was introduced in the body of the paper. The hope 
was that the nice properties of the Spearman linear case, including closed 
form, might carry over to this problem. Such a hope was quickly seen to 
he unfounded. In particular, an atteiiipt to substitute the polynomial 
model in (A 13), c\en just the linear approximation, yields seemingly in- 
tractable expressions. One result from this exploration may be of value 
and is therefore perhaps worth reporting, 

II we accept the evidence given earlier that the linear model given by 
the first two terms of the polynomial series (24) is in general a good ap- 
jMoxiniatiOM U) the normal ogive model, we first consider substituting 
this in the condition (A13) lor a Wl.S, i.e. MI solution, to yield 

1* \h , {h.n f h,ix){\ />,o l),iX)\{\\ h,u h,^x) -0. (A22) 




Aliernative Approaches: Fitting Latent Trait Models 237 



It seems fairly obvious that the linear approximation does not yield a 
solution in closed form. In desperation, with some, but not much, 
theoretical justification, we consider instead applying directly the ULS 
expression (A4), which yields 



(A23) 



that is, by (25) and (27) 



where 



a, \ aj /.J 



(A25) 



This closed-form linear least squares estimate of .v is not without a cer- 
tain plausibility of expression. By theory given in Lord and Novick 
(1968, pp. 377-8), the quantity \/a, is the correlation between a and j,, 
and is a measure of the discriminating power of the item, while the quan- 
tity A^l - \i,fcL,\ is the proportion of examinees passing the item. In (A24), 
the contribution of an item to the estimate is weighted in three reasonable 
ways. First the item is weighted proportionally to its discriminating 
power, I /a,. Second, the item is weighted by n\ -^/aj, so that greater 
weight is given to items near the mean of the distribution of .v than to 
items further away in either direction. Third, if the item is difficult, it 
gives a larger absolute value to the (positive) contribution from passing 
the item than to the (negative) contribution from failing it, and con- 
versely for an easy item. It is conjectured from the form of (A24) that x 
will prove an acceptable closed-form estimate of the examinee's latent 
trait, given a reasonable number of items. The approximation involved 
should be best at the middle of the distribution. At the extremes, cor- 
responding to a total test score of zero or n, where the maximum 
likelihood estimate is infinite, .v cannot be correct. But this is not 
necessarily a disadvantage. 
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A Perspective on the Seminar 



Donald Spearritt 



It was recognized in the planning of the invitational seminar that the 
authors of the invited papers would examine the contributions and 
potentialities of latent trait measurement procedures at a number of 
levels. Somq papers would emphasize the place of latent trait procedures 
within the general stream of the theory and practice of measurement in 
education and psychology. Others would emphasize theoretical issues in 
latent trait measurement which have arisen in the course of finding solu- 
tions to practical problems. Some would be more directly concerned with 
demonstrating practical applications of latent trait models in large-scale 
educational testing and in particular areas of long-standing interest in the 
development of tests in education and psychology. With the expected 
diversity in the papers, it was anticipated that it would be a useful exer- 
cise to draw together the main themes of the seminar and to assess its 
contribution to the improvement of measurement in education and 
psychology. This task fell to the chairman of the seminar. The approach 
taken has been to consider first the trends raised in Thorndike's opening 
address, and then to reflect on the main themes. 

In his introductory address, Thorndike provided the seminar par- 
ticipants with an excellent overview of the origins and broad trends of 
psychometric theory and practice over the past 75 years. His references 
to Binet and Simon and Spearman are a reminder that the notion of la- 
tent traits has been in use for a long time, though the conception of latent 
traits or underlying abilities in the Spearman model is rather dilVcrent 
from that coming unc'-r the rubric of latent trait theory today. He in- 
dicates why and how two of the main streams of psychometric theory 
were developed — the theory of measurement error with its notions of 
true score and reliability, and the theory of the organization of human 
abilities, both deriving in large measure from Spearman's early work. 
Lest we hasten to demolish the old temples too quickly, he gives us a 
timely reminder that the classical measurement model was responsible 
for producing a body of useful knowledge about tests. He also notes the 
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cnicpzciicc ot an aUcmativc notion of true score in more recent times, in 
which true score represents the universe of behaviour that would be ap- 
proached if the number of relevant test tasks were increased without 
Mniii. Bui neither this notion nor the components-of-variance model to 
which it leads received much attention in the seminar. 

f he pervasive intluencc of multiple-choice items on testing theory and 
pracMce is also noted, as is the development of indices for item analysis 
purposes, vi/. item ditlicultics or facilities, and item discrimination in- 
dices. Despite their shortcomings, these indices have influenced educa- 
tional and psychological measurement at the workfaee as well as in the 
research laboratory. 

Ii is enliglitcning to have had Thorndike's perspective on what he sees 
as the I wo major competing models to interpret a test score, that is, the 
domain sampling model and the latent trait model. While the seniinar 
was iari^clv concerned with the latter, we are reminded that a very 
Ni)i>stainial amount of work has been done in recent years on dehning the 
univcisc of behaviour to which we wish to generalize from our tests, on 
critcrK)n rctcrcnced tests and on testing for mastery. Though the domain 
Nampiini' model is not without its problems, it represents an important 
aspecl ot educational measurement which must continue to be explored. 

[ horndikcN concept of the latent trait model as a vertical dimension 
on which the individual person is to be located is a useful one. There are, 
as he notes, some real difhculties in conceptualizing some aspects of 
educational achievement in terms of the latent trait model, though set- 
ting up diiTicnsions such as 'competence in history' may go some way 
lowards meetnig these difficulties, 

I ateni trait models, the aspect of educational and psychological 
measurement which forms the main subject of this conference, were con- 
sidered m the fmal section of Thorndike's paper. He distinguishes two 
schools of thought - the 'one-parameter' school represented in the Rasch 
approach, and the 'three-parameter* school led largely by Lord. The pros 
and cons of these approaches were argued by Thorndike and debated in 
greater detail in subsequent papers in the seminar. 

In then own way, the individual papers have each made a contribution 
t() (he (hcorv and or practice of lalent trait measurement. But what has 
been (he contribution of the seminar to the field, considering the set of 
papers in loto? One convenient way of making this assessment is to for- 
nuilaie some fundamental questions and to see what light has been 
thrown upon these bv the various papers. 

^v^^l( fi or rHf: latent trait models 

(ilVES [HE BE.ST FIT TO ITEM DATA? 
fhis has been a controversial question for some time. The contest has 
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bccti latacly bciwceii liic Kasch onc-paranicicr model, involving iicm 
difriCLiliy only, and ilic ihrcc-paranKMcr model, involving item 
discrimination indices and guessing parameiers also. As Lord (1970) 
denionsiraies, good Ills appear to be obtainable with the thice-parameter 
model, which takes accoLinl of the ditVering slopes of item characteristic 
curves and the level of the lower asvmptote, as well as item dlHiCLilties. 

On this criterion, the three-parameter model, not Linexpectedly, has 
the advantage. But it is relevant to ask, as I horndike does, whether such 
a complex model is really required. Are its advantages outweighed by tlie 
need for substantial computing facilities and lor large samples lor item 
iryouts? With respeci to these questions, McDonald's study provides 
M)me pronusing hndings. \V\s results suggest that precise estimates of 
abililv can be obtained with sample sia^s as low as 500, which is about a 
desirable minimum value of A for item analysis studies. 



f rom a praciical point of view, the Rasch model obviously has the ad- 
vantage ot bemi' snnpic to operate and requiring smaller tryoui samples 
of persons. Hui f homelike indicates that the model provides only a 
rough \\\ [o the data m some cases, depending on (he dilferences among 
the values ot item discrimination indices for a set of items and the extern 
to which guessing is involved in students' selection of answers. He sees 
the Rasch model as being rather more successful with consiructed- 
rcsponse items than with multiple-choice items, though with the 
qualification that carefully selected multiple-choice items arc likely to 
provide good tUs, Clioppin recognizes that the model is designed to give 
an approximate rather * n an exact representation of data, but argues 
on the basis oi his extensive experience with the model in studies carried 
out bv the National l-oundalion for Educational Research in E:ngland 
and Wales that the model is robust with respect to violations of its 
underlving assumptions, and presents empirical evidence to support this 
argument. Working in the area of cognitive development, however, in 
which ahilii\ is likely to change over a period of time, Keats shows thai a 
one parameter model is inadequaie. A two-parameter model of cognitive 
develiipnieru which uses as itKiividual dilference parameters both the 
asymptotic value of a person's ability and the rate at whicii he ap- 
proaches ihai level gives a raihcr better fit to the data than a one- 
parameter model nivolving si)me index of IQ. 

( lioppin noies that one of the suggested applicaiions of the Rasch 
model involves the identification of responses which arc Mucky guesses', 
and the ediiine out of the ilems which produce\uch responses. 1 his type 
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of cdiiini; would have ihc ctlcci of improving ihc fii lo the onc-paramelcr 
model. Il would seem highly desirable in sueh applications lo seek some 
independeni veritieaiion of ihe statistical aberrations. Some interviewing 
ot sludenis about their lucky guess responses would indicate whether the 
editing was jusiihed. 

The question at issue conceriis the robustness of the Rasch model. 
How tar can the data depart from the model before leading us to draw- 
incorrect inferences? Argumenis about the relative virtues of the Rasch 
model and the three>parameter logistic model are reminiscent of other 
controversies over the last three decades concerning the violation of 
assumptions underlying statistical tests, for example, the robustness ot 
the F- lest in analysis of variance. In such controversies, it has often been 
found that the models aiiow considerable relaxation in their underlying 
assumptions betorc ihey begin lo support false inferences. It would not 
be surprising to tind that the Rasch model exhibited a similar degree of 
fobusi ness. 

I here IS a lurthcr question that may be asked in considering the 
yoodness of \\\ of the Rasch model. If better fits to a set of data are pos- 
sible, docs (his mean that less good fits musi be discarded, even if tficy 
lake a fraction ol the lime lo obtain and provide a satisfactory approx- 
imaiion to (he (jucstions beinu asked of the data? While further empirical 
studies will be of assistance in answering (his question, il would seem 
reasonable to draw the tentative conclusion that the Rasch model is a 
satislactor> model l(^r estimating item and person ability parameters, 
unless its applicability to a set of items is obtained at the expense of 
discarding too man\ items. 

\RF- [ XiSIINCi ILSTS Ol- FIT FOR LATENT 
IRAir MODM S SATISI-ACTORY? 
I his question was laken up by Douglas and McDonald. Both authors 
regard the existing chi-s(|uarc tests of fit as unsatisfactory. They are more 
likely t(^ yield a significant non-fit with increase in sample si/e. 

Dcniglas urges that the present approximate tests of fit be used with 
caution. One of the advantages of the generic Rasch model which he 
Jeiivcs 111 his paper bv means of conditional inference approaches is that 
u Ictids to a lest of (ii based on likelihood functions, though the ap- 
plicabilitv of the lest is limited by numerical analysis problems. Douglas 
notes that an exact test of ht of data to a Rasch model is theoretically 
possible, ihrough the use of conditional •inference procedures free of all 
paiamcicis in (he model. Such a (es( has still to be developed, but he sees 
ti fnomismg line of development in the approaches taken by Agresti. 

McI)(Hiaid*s paper is largely concerned with improving methods of 
Ik tint:, and testing the In of, latent (rait models. He notes that diHicultics 
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such as lack of convergence have been experienced with programs for 
filling laieni trail models which involve the simultaneous estimation of 
item parameters and person parameters and that the difficulties have 
been magnified if guessing parameters have also been estimated.^ He 
doubts that the Rasch model is free of these difficulties. He questions the 
appropriateness of the chi-square approach, since rto account is taken of 
the si/e of the residuals. He concludes that satisfactory criteria for testing 
the fit of latent trait models have been lacking, and notes that, even when 
a prior test of unidimensionaliiy of the data has been made, it has relied 
on a linear factor analysis of a non-linear set of data. 

McDonald's use of non-linear factor analysis is a promising approach 
to the fitting of latent trait models, since it avoids the false assumption of 
linear item chaiacleristic curves. He shows that the normal ogive model, 
for example, can be expressed through orthogonal polynomials as a 
linear combination of nonlinear functions of the latent trait. Constraints 
on the model cnn be introduced into the program (COSAN) used for 
fitting the model, a particular set of constraints providing the equivalent 
of the Rasch model. He demonstrates that the program provides a 
satisfactory fit to the normal ogive model in a one-, two-, or three- 
parameter form lor a range of data sets. Since this approach requires no 
prior test of the dimensionality of the data, and takes account of the si/e 
of the residuals in assessing the goodness of fit, there is some support for 
McDonakfs claim that it is superior to the usual methods of testing fit. 
fherc remains the problem of determining whether his approach is 
rt)busi with respect to violation of the assumption of normality in the 
distribution of the latent trait. 

T here is obviously scope for improvement in testing the fit of latent 
irait models, and the rather ditferent approaches of Douglas and 
McDonald to the problem provide useful directions for further investiga- 
tion. 

HOW LFf hCriVE ARE CONDITIONAL 
AND L NCONDITIONAl. PROCEDURES FOR THE 
ESTIMATION OF ITEM PARAMETERS? 
I lie 197f)s lune been marked by a considerable amount of interest in (he 
development of unconditional and conditional maximum likelihood pro- 
cedures for the estimation of item parameters. Unconditional procedures 
which involve the simultaneous estimation of both item and person 
abilitv paran^eters have (he disadvantage of yielding inconsistent 
cstinnilcs ol iicin parameters. (Conditional procedures yield estimates of 
item parameters conditional on person ability parameters and are com- 
monly accepted as possessing theoretical advantages, especially with 
respect to the testing of goodness of fit. The seemingly intractable 
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lUiincrical problems ot estimation witfi conditional procedures led in- 
vestigators to make some progress with the refinement of unconditional 
estimation procedures. ; 

As mentioned in the previous section, McDonald questions the accept- 
ability ot' procedures which invoke the simultaneous estimation of item 
and person ability parameters, on grounds which include the uncertainly 
of convergence. The ditlicullies are exacerbated by the inclusion of guess- 
ing parameters, unless a wide range of abilities is involved. 

In presenting his generic Rasch model, Douglas provides a more 
uenerali/ed framework within which estimation procedures for item 
parameters can be considered. By determining how many data sets. could 
have produced ilie observed marginal totals for all parameters, he can 
estimate the conditional likelihood of the observed data set given the 
observed mai^ginal totals. This enables him to focus on any designated set 
of parameters, say, item ditliculty, and to estimate the conditional 
likelihood tor the set of item marginals given the set of raw scores, thus 
allowmg hiiTi io arrive at item parameter estimates which are not depen- 
dent on subject ability parameters. A number of numerical analysis 
problems have to be solved, however, before these conditional ma.ximum 
likelihood estimation procedures can be put into operation. He 
recogni/es thai (nistatsson (1980) has recently developed conditional 
estimation procedures which can be successfully applied in tfie Rasch 
model tor up lo 100 dichototuous test items, but has some reservations 
about the eflects of extreme item parameters and rounding errors on the 
estimates \ieldeJ by these procedures. Pending further work on con- 
ditional estimation procedures, he recommends the use of unconditional 
esiiiTiales o\ parameters, which can be subsequently ^corrected' to the 
corresponding conditional estimates, though the extent of applicability 
ot such corrections has also to be explored. 

Doiiizlas's paper is an important one with respect to the estimation of 
both iieni parameters and person parameters, and opens up new avenues 
tor investigation in this technically complex aspect of latent trait 
mcasuremenL 



C AN LATEM FRAtT MODELS COPE 
WITH FHE l-ACT THAT ABILITY PARAMETERS 
C HANC.E AS A RESULT OF 
INSTRUCTION AND OVER TIME? 

Keais points out thai abilitv is a trait which is likely to change with time, 
and that the abilitv piirameter being estimated through person 
parameters tn the Rasch model and some other latent trait models is 
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limc-bound. Taking cognitive development as an area in which ablHty 
can be expected to change, and using in turn both ratio IQs and dev iation 
IQs as the individual diflercnces parameter, he finds that the one- 
parameler model fails to account satisfactorily for cognitive develop- 
ment. Rather belter results were obtained with a two-parameter model of 
cognitive developmeni, which incorporates as individual dilterences 
parameters both the asymptotic value of a person's ability and the rate at 
which he approaches that level. While this approach provides a pro- 
cedure for accounting for a change in ability, it was seen to involve some 
dilticullies by some of the seminar participants. Reliable information 
about both the asymptotic value of ability and the rate of development 
may be diflicult to obtain for a significant proportion of persons prior to 
adulthood. More fundamental was the question of whether latent trait 
models should be expected to cope with changing ability. Should the 
ability parameter be an estimate of ability at the time of measurement 
onlv, or an estinuiie which also took into account the likely ultimate level 
of development ot ihat ability? 

I he use of latent trait models in the measurement of change was con- 
sidered also b\ Spada. In their paper, Spada and May set out the ra- 
tionale of, and sotue practical applications of the Linear Logistic Test 
Model (I l.FM), which was developed during the !970s by a number of 
I.uropcan latent trait theorists to overcome the problem involved in 
measuring cliangc. In eifect, the problem is handled by breaking down 
the usual item parameicrs into linear combinations of the operations in- 
volved in finding I lie solution to the item; these include not only cognitive 
operations but components such as theefVect of ditt'erent types of instruc- 
tion. V\ hercas change in item difTicuhy with time or instruction is diflicult 
to represent \\\ the Rasch model, it can be adequately represented in a 
model sucli as the LI f M which analyses the difTicuIty of each component 
operation. Operation dilViculty parameters can be used to arrive at 
estimates of item diHiculty parameters. 

Spada and May argue significantly that the structure of a task or item 
is not likely to be the same for all persons in a sample, as is assumed by 
both the 1 I TM and the Rasch models. Intellectual development does not 
neces>arily take the form of increasing mastery of the same solution 
algoriihtn, but may be charactcri/ed by the appearance of dilVerent solu- 
tion algorithms. By allowing change to be examined at a basal level, the 
LLTM provides more possibilities for coping with structural change than 
does the Rasch model. It has distinct advantages for the evaluation of 
laciors coninhuiing to change in item difficulties, and considerable 
potential as an approach to the study of change. As McDonald noted 
during the seminar, the significance of this procedure lies more in its new 
approach lo the modelling of change than to the measurement of change. 
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HOW CAN I /M L NT I RAI T THEORY IMPROVE 
TEST DEVELOPMENT PROCEDURES? 
One ot the main purposes in singling out lalenl trait theory as the major 
locus or this seminar was the gap between the theoretical advances in lest 
ihcorv and the procedures being applied at the practical level in test 
development, which by and large have taken little or no cognizance of 
these advances. Hence it was highly appropriate that some attention be 
given in the seminar to practical applications of latent trait concepts in 
test development. 

A commonly claimed advantage of latent trait models is that they 
lacilitate the equating of scores across tests, because of the availability of 
sample-lree item parameters and item-tree person parameters. This par- 
ticular application ol the Rasch Simple Logistic model was examined by 
Morgan in relation to the equaling of dilferent trial and linal forms of the 
Australian Scholastic Aptitude Test (ASAT), through the use of link 
Kerns at hoili the whole-test level and sub-test levels. The procedures 
vjcneraliv followed the Rasch common item method of equating tests as 
set out 111 W right and Stone (1979). 

A lunnhei of findings from this study are likely to be of general interest 
to test consiriiciors. Ii was found that items which did not conform to 
the Rasch model were largely those with very high or very low item 
d^crimifiation indices. The percentage of ASAT items conforming to the 
Rascli model was greater when items were calibrated within their respec- 
tive sub-tests rather than across the whole test, presumably because of a 
greater degree of unidimcnsionality within the sep:)arate areas. In this type 
of equating exercise, test constructors should make sutlicient allowance 
for the loss ol link items which do not ht the Rasch models. The elfect of 
the positioiung of link items within a lest also seems worthy of further 
stud> , 

Morgan's study, demonstrates that it is possible to use Rasch models to 
equate tactorially complex scholastic aptitude tests at both the whole-test 
and suh-test levels. It would be useful to ascertain what was happening to 
the item pool in (he process, is (he factor composition of the finally 
^ckvleJ Items less complex than that of the original pool? Morgan's sug- 
gesiit^n ot classitving a set of test items into homogeneous sub-tests 
before applsnig the Rasch model seems sensible, and akin to the old 
question of .whether to use total score or verbal and quantitative sub- 
^cores as the criterion against which to analyse individual test items, 

l/cud and W hile's paper is an attempt to make latent trait analysis pro- 
cecluies accessible to classroom teachers, a development which must 
occur 1 1 latent irait models are to have a significant impact on educa- 
tional testing as distinct from educational and, psychological measure- 
meiu in research, in the classical measurement tradition, booklets such 
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as Ldiicational lesiiiig Service's ShikinK the classroom test or Multiple 
choice questions: A close look have iuid a major influence in spreading 
ideas about item analysis procedures and notions of reliability and 
validity among teachers. These ideas have become more readily available 
to Australian teachers in recent years through the publication by state 
education departments or examining bodies of series of booklets under 
such titles as 'School Assessment Procedures'. 

l/ard and White describe the use of Rasch procedures to develop a 
pool of calibrated items for use by teachers. They distinguish between 
progress tests consisting of small numbers of items to indicate the degree 
of mastery of specific skills, and review tests, a large collection of items 
designed to give a broader coverage of a student's performance in a con- 
tent area. 1 heir example, using a uniform lest with items evenly spaced 
across the diiliculty range, suggests that tests with small numbers of items 
can be satisfactorily prepared for progress tesls to allow a student to be 
regarded as having mastered a skill if he scores 5 or 4 on a five-item test, 
or liaviiig not mastered it if he scores 0 or 1 . While their procedure for 
developing progress tests depends on applying the characteristics of a 
narrow test, (hey use the si/e of the standard error of measurement as a 
criterion to determine the appropriate length of review tests. 

Once banks of calibrated items are prepared, their use by teachers in 
constructing classroom tesls will depend very much on whether simplified 
and easily understood item calibration procedures are available. The 
simplified f^KOX procedure from Best Test Design (Wright and Stone, 
1979) and the method of calibrating teacher-made items on to the item 
bank scale by using link items selected from the latter scale are illustrated 
b> l/ard and White, and would seem to have a reasonable chance of im- 
plementation b> teachers who are prepared to put this extra efTort into 
their assessment practices. 

I hc use of worksheets of the type suggested by Izard and White will be 
essential if the Rasch procedures arc to be applied by teachers in their 
own tests. The task of making clear to teachers the assumptions and con- 
cepts of Rasch measurement is likely to be more difficult, but should be 
aided h\ manuals such as Best test design (Wright and Stone, 1979) 
wlueh provides exceptionally clear and simple presentations of basic con- 
cepts ot nieasurcnient. 

The work done by l/ard and White in the development of short pro- 
gress tests exemplifies ThorndikeN argument that it is in the areas of in- 
di\iduali/cd and adaptive testing that latent trait models, which are well 
suilcd to the estimation of a subject's precise location on a trait dimen- 
sion, will liavc a considerable impact on test development procedures. 

item banking is an area in which latent trait theory could make a 
major contribution to measurement, again because of its .sample-free 
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item parameters. 1 he value ol item banks is cnhaneed by the availability 
of invariani item dittieuUy parameters. The imporlanee of item banking 
is underlined in ChoppinN paper, both with respcet to national monitor- 
ing programs and its general appiieation for use in sehools. Some 
reservations were expressed at the seminar as to whether the Raseh model 
\Nas sufliciently robust as a base lor monitoring programs, and par- 
tieularly about the etleets of the proposed multi-ehaining proeedures on 
item parameters, which might be expeeted to beeome considerably less 
stable with moves away from centralized curricula to school-based cur- 
riculum development. There is little doubt that laient-trait-based item 
banks will be of assistance to schools in their assessment procedures. 
Their use in monitoring programs is somevs hat more controversial, 
although it is difheuh to determine the mix of technical measurement 
problems and the educational and political overtones in such controver- 
sies. Theoretical questions about the viability of latent trait models for 
such purposes will need to be considered in the light of the extensive 
practical experience that the National Foundation for Educational 
Research in ttngland and Wales has acquired in the use of such models. 

Choppin also describes some novel applications of latent trail models 
to particular practical problems in educational testing, such as the deter- 
mination ol bctween-marker agreement and the handling of score 
matrices with incomplete observations, 

WHAT WhKE: THE: MAJOR CONTRIBUTIONS 
Of- THh SEMINAR TO MEASUREMENT THEORY? 
I he reader mav have gained the impression from the previous pages tliat 
vMdespread consensus of views v\as the order of the day. Such vvas not 
the case. A number of participants, and especially McDonald, felt that 
the Rasch and other latent trait modellists may be in danger of cutting 
themselves oif from other areas of psychometric theory, and warned 
against a complete rejection of older models in the search for new 
models. Despite the dillerences in orientation on this issue, it is apparent 
that some important contributions towards the unification of test theory 
vscre made in the seminar, particularly by Andrich, Douglas, and Keats. 

Andrich has made an undoubted ^'ontribution to bridging the gap be- 
tv\een older cind more recent models measurement by showing that the 
Rasch latent trait model synthesizes the Thurstone and LTkert ap- 
proaches to attitude measurement. He perceptively observed that 
Thurstone v\as searching for an attitude measure which vvas invariant 
across dittercni groups of persons, and a person measure which was in- 
variant across drilerent sets ot statements, whereas l.ikert was not. He 
expresses T hurstone-type scale values in the form of the simple logistic 
model for dichotomous-response rating scales, and derives a bating 
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response modeP tor the ordered polychotomous response case by taking 
the threshold and statement values as additive components in the model, 
and bringing together the results obtained from considering each 
threshold independently. In effect, thcThurstone and Likert approaches 
are considered as special cases of the dichotomous and polychotomous' 
form respectively of the ordered response model. This development of a 
Rasch rating model represents both a theoretical and a practical con- 
tribution to the improvement of measurement of attitudes, 

Douglas in turn has made a substantial theoretical contribution in 
generalizing the theory behind the Rasch model to incorporate variants 
of the modcK includinjz Andrich's model for polychotomous attitude 
scale items and the Rasch/Andrich essay grading model, Douglas's 
generic model reduces to the standard Rasch approach in the case of 
dichotomously scored items. 

Keats was the only author to make a direct examination of the rela- 
tionship between classical test theory and latent trait theory. He arrived 
at the important generalization that the true scores in classical test theory 
show an explicit relationship with latent ability values only when all items 
have identical item characteristic curves. This condition might be re- 
gard(?d as unficcessarily restrictive by test developers, although there 
would be practical advantages in having tests of equivalent items 
available at a number of different age levels. 

Ciivcii that these three authors were successful in achieving some 
further integration of measurement theory, there was nevertheless a feel- 
ing on the part of some of the seminar participants that latent trait 
theorists were failing to take sufficient account of the mainstream of 
measurement theory. This point of view was resisted by some of the la- 
tent trait theorists, who thought that premature attempts to integrate 
differing theories might obscure the special features of new theories, 
McI3onald\ plea for a more concerted effort on the part of latent trait 
theorists to consider their models in the context of other aspects of 
measurement theory is worth heeding, it would seem incongruous, for 
instance, not to expect some correspondence between the latent trait 
measures yielded by factor analysis of item data and those estimated 
through latent trail models. 

WWW CHANGES IN MEiASUREMENT PRACTICE 
ARE LJKEIY TO RESULT FROM THE 

INC REASlNCi USE OF LATENT TRAIT MODELS? 
h w ill be appropriate to complete this overview of the seminar with some 
personal predictions of the changes which are likely to occur in measure- 
ment practice in Australia as the result of an increasing use of latent trait 
models. These will coincide to some extent with the predictions made in 
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Fhorndikc's papci . since some ot the new approaches made possible by 
latent trail models will be likely lo be adopted irrespeeiive of any 
national dilferenees in the philosophy or praciiee of measaremeni, 

E'Acept in the ease of short ^progress' tests designed to estimate fairly 
precisely a person's position on a dimension, there seems likely to be little 
change ill the types and characteristics of items selected lor achievement 
and abilii\ lesis because of the use of latent trait models. Most of the 
Items which are acceptable in terms of traditional item facility and 
diserimination indices are likely to be acceptable in a latent trait model. 
It experience indicates that some ^good' items under traditional indices 
are rejected by latent trait models because of lack of iit, praetitioners 
ma\ well show some inclination to question the model as well as the 
items. In etiect. tests of any reasonable length can be expected to have 
similar distributions of iiem facility and discrimination indices, and 
similar levels of reliability as they have at the present time. 

Major changes can be expected in the provision of short progress tests 
witfi umfoim distribution properties of the kind deseribed in^he 1/ard 
and While paper. Testing is likely to be used more for instructional pur- 
poses and s()mewhat less for survey purposes, especially if eompuier 
facilities became more readily available at the local level. This will 
stimulate demand tor individuali/ed testing, for adaptive or tailored 
tests, and the grealer availability of such tests will in turn promote a 
greater use ot tests tor instructional purposes. Fhe tact that latent trait 
models can be used to provide fairly precise item-free estimates of person 
ability will probabl\ lead to their widespread adoption in the develop- 
meru of such tests. There may well be some development of fine-grained 
tests in accordance with the linear logistic test model to assess vvhether 
tfic indiv (dual operations required to answer an item have been mastered. 

1 ateni trail models are also likely to improve the quality of item banks 
and to gencraie more interest in their use. Teachers are likely to become 
more accepting of item banks as an additional resource in their leaching 
and lesiing, and more so it the items are accompanied by adequate 
sample-free information about their parameters. 

Some slackening of demand for norm^-ref erenced tests can he an- 
iicipaied. though it is noi likely to be very pronounced. F caehers arc still 
likciv to be interested in comparing the performance of their students 
with other appropriate reterence groups, even if they have item-free in- 
dices of iheir students' achievement in different subjects. Their reliance 
on norm-ref erenced tests will be greater if it proves to bcdifVicult to apply 
latent if ait models to achievement tests in some of the traditional content 
areas. Norm referenced tests are likelv to retain their appeal also for 
educational administrators and tor psychologists involved in the assess- 
ment of ability and aptitude. 
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A major obstacle in gaining widespread acceptanee of latent trait 
models will be the inherent complexity of the measurement notions on 
which ihcy rest. It takes years to wean the public and even teachers away 
from the idea that marks can only be interpreted on a percentage scale. 
There has obviously been some increase since the 1940s in the percentage 
of the community with some understanding of the ideas of standard 
deviation and percentiles, but these notions are still not understood 
widely. Ctiven the recent public controversies in Australia about the con- 
version of raw scores to scaled scores, one must view with some trepida- 
tion the public's likely degree of understanding of a student's score on an 
underlying ability or achievement scale which does not range from 0 to 
100. A great deal of etlort w ill have to be expended in communicating the 
meaning of the new score scales to teachers and the public. If these ideas 
remain inexplicable to people at the level at which they arc to be im- 
plemcnicd. their implementation is unlikely to be successful. 

Although the prospects of an early widespread acceptance of latent 
trail measures ip. the public educational domain seem dim, they are prob- 
iM\ much brighter in the areas of educational and psychological '"^ 
research. Research can only benefit from the use of sample-free item 
scores and item free person scores. If the seminar was successful in rais- 
ing the level of understanding of latent trait models among researchers 
and incasurcmcnt specialists in Australia, it will have proved to be a 
significani event in the improvement of measurement in education and 
psvchologv in Australia, and a fitting event to mark the ACE:R's 
achievements in educational and psychological nieasuremcnt on the 
c>ccasion ot its golden jubilee vear. 



( jiiNtat^^on, 1. A solution U>i t lie conditional estimation problem Tor long lesis in 

(he R.isch niOLlel Tor dicliotomous iienis, luiucafiona/ and P\vcholom'ul 
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